Fine-Tuning LLMs

Fine-tuning adapts a pre-trained LLM to your specific task or domain by training it further on your own data. But fine-tuning is not always the right answer. This lesson covers the decision framework, parameter-efficient methods like LoRA, dataset preparation, and practical training with Hugging Face.

The Decision Framework

Before fine-tuning, ask: (1) Can prompt engineering solve this? (cheapest, fastest) (2) Can RAG solve this? (best for knowledge/facts) (3) Do you need to change the model's behavior, style, or format consistently? Then fine-tune. Fine-tuning is best for: custom output formats, domain-specific tone/style, consistent behavioral changes, and tasks where prompting falls short.

When to Use Each Approach

Approach	Best For	Cost	Speed to Deploy
Prompt Engineering	Clear tasks, standard formats	Free	Minutes
RAG	Knowledge grounding, private data, changing info	Low	Hours
Fine-Tuning	Custom behavior, style, format, specialized tasks	Medium-High	Days
Full Training	Entirely new capabilities, new languages	Very High	Weeks-Months

Full Fine-Tuning vs Parameter-Efficient Methods

Full Fine-Tuning

Updates all model parameters. For a 7B parameter model, this requires:

28+ GB of GPU memory (in fp32)

14+ GB in fp16/bf16

Large datasets (10,000+ examples)

Full fine-tuning is rarely practical for most teams.

Parameter-Efficient Fine-Tuning (PEFT)

Only updates a small subset of parameters while freezing the rest. This dramatically reduces memory, compute, and data requirements.

LoRA: Low-Rank Adaptation

LoRA is the most popular PEFT method. Instead of updating the full weight matrix W, LoRA adds two small matrices (A and B) that approximate the weight update.

How LoRA Works

Original: y = Wx
LoRA:     y = Wx + BAxWhere:
  W is the frozen original weight matrix (e.g., 4096 x 4096)
  A is a small matrix (4096 x r) - initialized randomly
  B is a small matrix (r x 4096) - initialized to zeros
  r is the "rank" (typically 8-64) << 4096

Instead of updating 16.7 million parameters (4096 x 4096), LoRA only trains 524,288 parameters (4096 x 64 + 64 x 4096 with rank=64). That is a 97% reduction in trainable parameters.

Where LoRA Is Applied

LoRA adapters are typically added to the attention weight matrices in each transformer layer:

Query projection (q_proj)

Key projection (k_proj)

Value projection (v_proj)

Output projection (o_proj)

LoRA Rank Selection

Rank	Trainable Params	Use Case
4-8	Very few	Simple style/format changes
16-32	Moderate	Domain adaptation, classification
64-128	More	Complex behavior changes
256+	Many	Approaching full fine-tuning

Higher rank = more capacity but more memory and risk of overfitting on small datasets.

Start with Rank 16

For most fine-tuning tasks, start with LoRA rank=16 and alpha=32 (alpha is typically 2x rank). This provides a good balance of capacity and efficiency. Only increase rank if your evaluation metrics plateau.

QLoRA: Quantized LoRA

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen base weights are stored in 4-bit precision (NormalFloat4), while the LoRA adapters are trained in 16-bit.

Memory Savings

Method	7B Model Memory
Full fine-tuning (fp32)	~28 GB
Full fine-tuning (fp16)	~14 GB
LoRA (fp16 base)	~14 GB
QLoRA (4-bit base)	~4-6 GB

QLoRA makes it possible to fine-tune a 7B model on a single consumer GPU with 8 GB VRAM.

from transformers import BitsAndBytesConfigquantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # NormalFloat4
    bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
    bnb_4bit_use_double_quant=True,     # Double quantization
)

Dataset Preparation

Fine-tuning data must be formatted in the structure the model expects.

Instruction Format (Alpaca-style)

{
  "instruction": "Summarize the following article.",
  "input": "The Federal Reserve announced today that...",
  "output": "The Fed raised interest rates by 0.25% citing..."
}

Chat Format (ChatML / Llama)

{
  "messages": [
    {"role": "system", "content": "You are a medical assistant."},
    {"role": "user", "content": "What are the symptoms of flu?"},
    {"role": "assistant", "content": "Common flu symptoms include fever..."}
  ]
}

Dataset Quality Guidelines

Minimum: 100-500 high-quality examples for LoRA

Recommended: 1,000-10,000 examples for robust fine-tuning

Diversity: Cover all expected input types and edge cases

Quality over quantity: 500 excellent examples beat 5,000 noisy ones

Deduplication: Remove near-duplicates to prevent memorization

Validation split: Hold out 10-20% for evaluation

python

1# Complete LoRA Fine-Tuning with PEFT and Hugging Face
2from peft import LoraConfig, get_peft_model, TaskType
3from transformers import (
4    AutoModelForCausalLM,
5    AutoTokenizer,
6    TrainingArguments,
7    Trainer,
8    BitsAndBytesConfig,
9)
10from datasets import load_dataset
11import torch
12
13# --- Step 1: Load base model with 4-bit quantization (QLoRA) ---
14model_name = "meta-llama/Llama-2-7b-hf"  # or any HF model
15
16quantization_config = BitsAndBytesConfig(
17    load_in_4bit=True,
18    bnb_4bit_quant_type="nf4",
19    bnb_4bit_compute_dtype=torch.bfloat16,
20    bnb_4bit_use_double_quant=True,
21)
22
23model = AutoModelForCausalLM.from_pretrained(
24    model_name,
25    quantization_config=quantization_config,
26    device_map="auto",
27)
28tokenizer = AutoTokenizer.from_pretrained(model_name)
29tokenizer.pad_token = tokenizer.eos_token
30
31# --- Step 2: Configure LoRA ---
32lora_config = LoraConfig(
33    task_type=TaskType.CAUSAL_LM,
34    r=16,                # LoRA rank
35    lora_alpha=32,       # Scaling factor (typically 2x rank)
36    lora_dropout=0.05,   # Dropout for regularization
37    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
38)
39
40model = get_peft_model(model, lora_config)
41model.print_trainable_parameters()
42# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
43
44# --- Step 3: Prepare dataset ---
45# Example: load and format a dataset
46def format_instruction(example):
47    return {
48        "text": f"### Instruction:\n{example['instruction']}\n\n"
49                f"### Input:\n{example.get('input', '')}\n\n"
50                f"### Response:\n{example['output']}"
51    }
52
53dataset = load_dataset("tatsu-lab/alpaca", split="train")
54dataset = dataset.map(format_instruction)
55
56# --- Step 4: Training ---
57training_args = TrainingArguments(
58    output_dir="./lora-output",
59    num_train_epochs=3,
60    per_device_train_batch_size=4,
61    gradient_accumulation_steps=4,
62    learning_rate=2e-4,
63    warmup_ratio=0.03,
64    logging_steps=10,
65    save_strategy="epoch",
66    fp16=True,
67)
68
69trainer = Trainer(
70    model=model,
71    args=training_args,
72    train_dataset=dataset,
73    tokenizer=tokenizer,
74)
75
76trainer.train()
77
78# --- Step 5: Save LoRA adapter (small file!) ---
79model.save_pretrained("./my-lora-adapter")
80# The adapter is only ~10-50 MB, not the full model

Cost Considerations

Fine-tuning costs include: (1) GPU compute for training (A100 40GB: ~$1-3/hour on cloud), (2) Storage for checkpoints, (3) Inference cost for serving fine-tuned models (you must host the full base model + adapter). For many use cases, a well-crafted prompt or RAG pipeline is more cost-effective. Only fine-tune when the quality improvement justifies the ongoing infrastructure cost.

Evaluating Fine-Tuned Models

Perplexity

Perplexity measures how "surprised" the model is by the test data. Lower = better.

import mathdef compute_perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs, labels=inputs["input_ids"])
    return math.exp(outputs.loss.item())

Task-Specific Metrics

Task	Metrics
Classification	Accuracy, F1, Precision, Recall
Generation	BLEU, ROUGE, human preference
Instruction-following	Win rate vs base model (human eval)
Code generation	pass@k (functional correctness)

Task Metrics
Classification Accuracy, F1, Precision, Recall
Generation BLEU, ROUGE, human preference
Instruction-following Win rate vs base model (human eval)
Code generation pass@k (functional correctness)

A/B Testing

The gold standard: run both the base model and fine-tuned model on the same inputs, then have humans (or a strong judge model) rate which is better. Report the win rate of the fine-tuned model.

Deploying Fine-Tuned Models

1. Merge adapter into base model (for simpler deployment):

   merged_model = model.merge_and_unload()
   merged_model.save_pretrained("./merged-model")

2. Serve with adapter (for multi-tenant, multiple adapters on one base model): - Load base model once - Hot-swap LoRA adapters per request - Tools: vLLM, LoRAX, Hugging Face TGI

3. Quantize for deployment** (GGUF/GPTQ for smaller, faster inference)