Fine-tuning Large Language Models (LLMs) makes them better for specific tasks like generating text, summarizing, and answering questions. Let’s see how to do this with easy examples and code snippets.
1. Text Generation
Text generation is creating meaningful text based on a given start. Fine-tuning models like GPT-2 or GPT-3 can help in generating human-like text.
Example: Fine-tuning GPT-2 for Text Generation
We’ll use the transformers library by Hugging Face.
Code Example:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Tokenize the dataset
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
# Set training parameters
training_args = TrainingArguments(
output_dir='./results',
per_device_train_batch_size=2,
num_train_epochs=1,
save_steps=10_000,
save_total_limit=2,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
)
# Train the model
trainer.train()
# Generate text
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")
Output:
Generated text: Once upon a time, there was a little girl who lived in a village near the forest. She loved exploring the woods and finding new adventures...
2. Summarization
Summarization is making a short summary of a long text. Fine-tuning models like BART or T5 helps in creating accurate summaries.
Example: Fine-tuning BART for Summarization
We’ll use the transformers library.
Code Example:
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset('cnn_dailymail', '3.0.0', split='train')
# Initialize tokenizer and model
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
# Tokenize the dataset
def tokenize_function(examples):
inputs = tokenizer(examples['article'], truncation=True, padding='max_length', max_length=512)
targets = tokenizer(examples['highlights'], truncation=True, padding='max_length', max_length=128)
inputs['labels'] = targets['input_ids']
return inputs
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["article", "highlights"])
# Set training parameters
training_args = TrainingArguments(
output_dir='./results',
per_device_train_batch_size=2,
num_train_epochs=1,
save_steps=10_000,
save_total_limit=2,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
)
# Train the model
trainer.train()
# Summarize text
input_text = "The quick brown fox jumps over the lazy dog. The dog barked and chased the fox into the forest."
inputs = tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=50, min_length=25, length_penalty=2.0, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")
Output:
Summary: The dog barked and chased the fox into the forest.
3. Question Answering
Question answering involves giving precise answers to questions based on a given text. Fine-tuning models like BERT or RoBERTa is effective for this task.
Example: Fine-tuning BERT for Question Answering
We’ll use the transformers library again.
Code Example:
from transformers import BertTokenizer, BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset('squad')
# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
# Tokenize the dataset
def tokenize_function(examples):
return tokenizer(
examples['question'],
examples['context'],
truncation=True,
padding='max_length',
max_length=384,
)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["id", "title", "context", "question", "answers"])
# Set training parameters
training_args = TrainingArguments(
output_dir='./results',
per_device_train_batch_size=2,
num_train_epochs=1,
save_steps=10_000,
save_total_limit=2,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
)
# Train the model
trainer.train()
# Answer question
context = "The quick brown fox jumps over the lazy dog."
question = "What does the fox jump over?"
inputs = tokenizer(question, context, return_tensors='pt')
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits
all_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].numpy()[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1])
print(f"Question: {question}")
print(f"Answer: {answer}")
Output:
Question: What does the fox jump over?
Answer: the lazy dog
Summary
- Text Generation: Creating meaningful text from a starting prompt by fine-tuning GPT-2.
- Example: Fine-tuning GPT-2 and generating text.
- Summarization: Creating short summaries of long texts by fine-tuning BART.
- Example: Fine-tuning BART and summarizing text.
- Question Answering: Providing answers to questions based on a given text by fine-tuning BERT.
- Example: Fine-tuning BERT and answering questions.
Fine-tuning LLMs for these tasks makes them better at specific jobs and useful in real applications. Try out the code examples to understand and modify them as needed.
[…] Pre-training techniques (self-supervised learning, masked language modeling)B- Fine-tuning LLMs for specific tasks (text generation, summarization, question answering)C- Prompt engineering and few-shot […]