5 (B)- Fine-tuning LLMs for specific tasks (text generation, summarization, question answering)

Fine-tuning Large Language Models (LLMs) makes them better for specific tasks like generating text, summarizing, and answering questions. Let’s see how to do this with easy examples and code snippets.

1. Text Generation

Text generation is creating meaningful text based on a given start. Fine-tuning models like GPT-2 or GPT-3 can help in generating human-like text.

Example: Fine-tuning GPT-2 for Text Generation

We’ll use the transformers library by Hugging Face.

Code Example:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')

# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Set training parameters
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=2,
    num_train_epochs=1,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

# Train the model
trainer.train()

# Generate text
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

Output:

Generated text: Once upon a time, there was a little girl who lived in a village near the forest. She loved exploring the woods and finding new adventures...

2. Summarization

Summarization is making a short summary of a long text. Fine-tuning models like BART or T5 helps in creating accurate summaries.

Example: Fine-tuning BART for Summarization

We’ll use the transformers library.

Code Example:

from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset('cnn_dailymail', '3.0.0', split='train')

# Initialize tokenizer and model
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Tokenize the dataset
def tokenize_function(examples):
    inputs = tokenizer(examples['article'], truncation=True, padding='max_length', max_length=512)
    targets = tokenizer(examples['highlights'], truncation=True, padding='max_length', max_length=128)
    inputs['labels'] = targets['input_ids']
    return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["article", "highlights"])

# Set training parameters
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=2,
    num_train_epochs=1,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

# Train the model
trainer.train()

# Summarize text
input_text = "The quick brown fox jumps over the lazy dog. The dog barked and chased the fox into the forest."
inputs = tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=50, min_length=25, length_penalty=2.0, num_beams=4)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")

Output:

Summary: The dog barked and chased the fox into the forest.

3. Question Answering

Question answering involves giving precise answers to questions based on a given text. Fine-tuning models like BERT or RoBERTa is effective for this task.

Example: Fine-tuning BERT for Question Answering

We’ll use the transformers library again.

Code Example:

from transformers import BertTokenizer, BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset('squad')

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples['question'],
        examples['context'],
        truncation=True,
        padding='max_length',
        max_length=384,
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["id", "title", "context", "question", "answers"])

# Set training parameters
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=2,
    num_train_epochs=1,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
)

# Train the model
trainer.train()

# Answer question
context = "The quick brown fox jumps over the lazy dog."
question = "What does the fox jump over?"
inputs = tokenizer(question, context, return_tensors='pt')
outputs = model(**inputs)

start_scores = outputs.start_logits
end_scores = outputs.end_logits

all_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].numpy()[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1])

print(f"Question: {question}")
print(f"Answer: {answer}")

Output:

Question: What does the fox jump over?
Answer: the lazy dog

Summary

Text Generation: Creating meaningful text from a starting prompt by fine-tuning GPT-2.
- Example: Fine-tuning GPT-2 and generating text.
Summarization: Creating short summaries of long texts by fine-tuning BART.
- Example: Fine-tuning BART and summarizing text.
Question Answering: Providing answers to questions based on a given text by fine-tuning BERT.
- Example: Fine-tuning BERT and answering questions.

Fine-tuning LLMs for these tasks makes them better at specific jobs and useful in real applications. Try out the code examples to understand and modify them as needed.

One comment

LLM (Large Language Model) – Topics – Core Java in 25 hours says:
at 11:43 pm
[…] Pre-training techniques (self-supervised learning, masked language modeling)B- Fine-tuning LLMs for specific tasks (text generation, summarization, question answering)C- Prompt engineering and few-shot […]

Java Programmatic Universe

Java- write once, run away!

5 (B)- Fine-tuning LLMs for specific tasks (text generation, summarization, question answering)

1. Text Generation

2. Summarization

3. Question Answering

Summary

One comment

Leave a Reply Cancel reply