Fine-Tuning LLMs
Fine-tuning adapts a pre-trained LLM to your specific task or domain by training it further on your own data. But fine-tuning is not always the right answer. This lesson covers the decision framework, parameter-efficient methods like LoRA, dataset preparation, and practical training with Hugging Face.
The Decision Framework
When to Use Each Approach
| Approach | Best For | Cost | Speed to Deploy |
|---|---|---|---|
| Prompt Engineering | Clear tasks, standard formats | Free | Minutes |
| RAG | Knowledge grounding, private data, changing info | Low | Hours |
| Fine-Tuning | Custom behavior, style, format, specialized tasks | Medium-High | Days |
| Full Training | Entirely new capabilities, new languages | Very High | Weeks-Months |
Full Fine-Tuning vs Parameter-Efficient Methods
Full Fine-Tuning
Updates all model parameters. For a 7B parameter model, this requires:
Full fine-tuning is rarely practical for most teams.
Parameter-Efficient Fine-Tuning (PEFT)
Only updates a small subset of parameters while freezing the rest. This dramatically reduces memory, compute, and data requirements.
LoRA: Low-Rank Adaptation
LoRA is the most popular PEFT method. Instead of updating the full weight matrix W, LoRA adds two small matrices (A and B) that approximate the weight update.
How LoRA Works
Original: y = Wx
LoRA: y = Wx + BAxWhere:
W is the frozen original weight matrix (e.g., 4096 x 4096)
A is a small matrix (4096 x r) - initialized randomly
B is a small matrix (r x 4096) - initialized to zeros
r is the "rank" (typically 8-64) << 4096
Instead of updating 16.7 million parameters (4096 x 4096), LoRA only trains 524,288 parameters (4096 x 64 + 64 x 4096 with rank=64). That is a 97% reduction in trainable parameters.
Where LoRA Is Applied
LoRA adapters are typically added to the attention weight matrices in each transformer layer:
LoRA Rank Selection
| Rank | Trainable Params | Use Case |
|---|---|---|
| 4-8 | Very few | Simple style/format changes |
| 16-32 | Moderate | Domain adaptation, classification |
| 64-128 | More | Complex behavior changes |
| 256+ | Many | Approaching full fine-tuning |
Start with Rank 16
QLoRA: Quantized LoRA
QLoRA combines LoRA with 4-bit quantization of the base model. The frozen base weights are stored in 4-bit precision (NormalFloat4), while the LoRA adapters are trained in 16-bit.
Memory Savings
| Method | 7B Model Memory |
|---|---|
| Full fine-tuning (fp32) | ~28 GB |
| Full fine-tuning (fp16) | ~14 GB |
| LoRA (fp16 base) | ~14 GB |
| QLoRA (4-bit base) | ~4-6 GB |
from transformers import BitsAndBytesConfigquantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
bnb_4bit_use_double_quant=True, # Double quantization
)
Dataset Preparation
Fine-tuning data must be formatted in the structure the model expects.
Instruction Format (Alpaca-style)
{
"instruction": "Summarize the following article.",
"input": "The Federal Reserve announced today that...",
"output": "The Fed raised interest rates by 0.25% citing..."
}
Chat Format (ChatML / Llama)
{
"messages": [
{"role": "system", "content": "You are a medical assistant."},
{"role": "user", "content": "What are the symptoms of flu?"},
{"role": "assistant", "content": "Common flu symptoms include fever..."}
]
}
Dataset Quality Guidelines
1# Complete LoRA Fine-Tuning with PEFT and Hugging Face
2from peft import LoraConfig, get_peft_model, TaskType
3from transformers import (
4 AutoModelForCausalLM,
5 AutoTokenizer,
6 TrainingArguments,
7 Trainer,
8 BitsAndBytesConfig,
9)
10from datasets import load_dataset
11import torch
12
13# --- Step 1: Load base model with 4-bit quantization (QLoRA) ---
14model_name = "meta-llama/Llama-2-7b-hf" # or any HF model
15
16quantization_config = BitsAndBytesConfig(
17 load_in_4bit=True,
18 bnb_4bit_quant_type="nf4",
19 bnb_4bit_compute_dtype=torch.bfloat16,
20 bnb_4bit_use_double_quant=True,
21)
22
23model = AutoModelForCausalLM.from_pretrained(
24 model_name,
25 quantization_config=quantization_config,
26 device_map="auto",
27)
28tokenizer = AutoTokenizer.from_pretrained(model_name)
29tokenizer.pad_token = tokenizer.eos_token
30
31# --- Step 2: Configure LoRA ---
32lora_config = LoraConfig(
33 task_type=TaskType.CAUSAL_LM,
34 r=16, # LoRA rank
35 lora_alpha=32, # Scaling factor (typically 2x rank)
36 lora_dropout=0.05, # Dropout for regularization
37 target_modules=["q_proj", "v_proj"], # Which layers to adapt
38)
39
40model = get_peft_model(model, lora_config)
41model.print_trainable_parameters()
42# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
43
44# --- Step 3: Prepare dataset ---
45# Example: load and format a dataset
46def format_instruction(example):
47 return {
48 "text": f"### Instruction:\n{example['instruction']}\n\n"
49 f"### Input:\n{example.get('input', '')}\n\n"
50 f"### Response:\n{example['output']}"
51 }
52
53dataset = load_dataset("tatsu-lab/alpaca", split="train")
54dataset = dataset.map(format_instruction)
55
56# --- Step 4: Training ---
57training_args = TrainingArguments(
58 output_dir="./lora-output",
59 num_train_epochs=3,
60 per_device_train_batch_size=4,
61 gradient_accumulation_steps=4,
62 learning_rate=2e-4,
63 warmup_ratio=0.03,
64 logging_steps=10,
65 save_strategy="epoch",
66 fp16=True,
67)
68
69trainer = Trainer(
70 model=model,
71 args=training_args,
72 train_dataset=dataset,
73 tokenizer=tokenizer,
74)
75
76trainer.train()
77
78# --- Step 5: Save LoRA adapter (small file!) ---
79model.save_pretrained("./my-lora-adapter")
80# The adapter is only ~10-50 MB, not the full modelCost Considerations
Evaluating Fine-Tuned Models
Perplexity
Perplexity measures how "surprised" the model is by the test data. Lower = better.
import mathdef compute_perplexity(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(inputs, labels=inputs["input_ids"])
return math.exp(outputs.loss.item())
Task-Specific Metrics
| Task | Metrics |
|---|---|
| Classification | Accuracy, F1, Precision, Recall |
| Generation | BLEU, ROUGE, human preference |
| Instruction-following | Win rate vs base model (human eval) |
| Code generation | pass@k (functional correctness) |
A/B Testing
The gold standard: run both the base model and fine-tuned model on the same inputs, then have humans (or a strong judge model) rate which is better. Report the win rate of the fine-tuned model.
Deploying Fine-Tuned Models
1. Merge adapter into base model (for simpler deployment):
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
2. Serve with adapter (for multi-tenant, multiple adapters on one base model): - Load base model once - Hot-swap LoRA adapters per request - Tools: vLLM, LoRAX, Hugging Face TGI
3. Quantize for deployment** (GGUF/GPTQ for smaller, faster inference)