Skip to main content

Fine-Tuning LLMs

When and how to fine-tune with LoRA, QLoRA, PEFT, dataset preparation, and deployment considerations

~50 min
Listen to this lesson

Fine-Tuning LLMs

Fine-tuning adapts a pre-trained LLM to your specific task or domain by training it further on your own data. But fine-tuning is not always the right answer. This lesson covers the decision framework, parameter-efficient methods like LoRA, dataset preparation, and practical training with Hugging Face.

The Decision Framework

Before fine-tuning, ask: (1) Can prompt engineering solve this? (cheapest, fastest) (2) Can RAG solve this? (best for knowledge/facts) (3) Do you need to change the model's behavior, style, or format consistently? Then fine-tune. Fine-tuning is best for: custom output formats, domain-specific tone/style, consistent behavioral changes, and tasks where prompting falls short.

When to Use Each Approach

ApproachBest ForCostSpeed to Deploy
Prompt EngineeringClear tasks, standard formatsFreeMinutes
RAGKnowledge grounding, private data, changing infoLowHours
Fine-TuningCustom behavior, style, format, specialized tasksMedium-HighDays
Full TrainingEntirely new capabilities, new languagesVery HighWeeks-Months

Full Fine-Tuning vs Parameter-Efficient Methods

Full Fine-Tuning

Updates all model parameters. For a 7B parameter model, this requires:

  • 28+ GB of GPU memory (in fp32)
  • 14+ GB in fp16/bf16
  • Large datasets (10,000+ examples)
  • Full fine-tuning is rarely practical for most teams.

    Parameter-Efficient Fine-Tuning (PEFT)

    Only updates a small subset of parameters while freezing the rest. This dramatically reduces memory, compute, and data requirements.

    LoRA: Low-Rank Adaptation

    LoRA is the most popular PEFT method. Instead of updating the full weight matrix W, LoRA adds two small matrices (A and B) that approximate the weight update.

    How LoRA Works

    Original: y = Wx
    LoRA:     y = Wx + BAx

    Where: W is the frozen original weight matrix (e.g., 4096 x 4096) A is a small matrix (4096 x r) - initialized randomly B is a small matrix (r x 4096) - initialized to zeros r is the "rank" (typically 8-64) << 4096

    Instead of updating 16.7 million parameters (4096 x 4096), LoRA only trains 524,288 parameters (4096 x 64 + 64 x 4096 with rank=64). That is a 97% reduction in trainable parameters.

    Where LoRA Is Applied

    LoRA adapters are typically added to the attention weight matrices in each transformer layer:

  • Query projection (q_proj)
  • Key projection (k_proj)
  • Value projection (v_proj)
  • Output projection (o_proj)
  • LoRA Rank Selection

    RankTrainable ParamsUse Case
    4-8Very fewSimple style/format changes
    16-32ModerateDomain adaptation, classification
    64-128MoreComplex behavior changes
    256+ManyApproaching full fine-tuning
    Higher rank = more capacity but more memory and risk of overfitting on small datasets.

    Start with Rank 16

    For most fine-tuning tasks, start with LoRA rank=16 and alpha=32 (alpha is typically 2x rank). This provides a good balance of capacity and efficiency. Only increase rank if your evaluation metrics plateau.

    QLoRA: Quantized LoRA

    QLoRA combines LoRA with 4-bit quantization of the base model. The frozen base weights are stored in 4-bit precision (NormalFloat4), while the LoRA adapters are trained in 16-bit.

    Memory Savings

    Method7B Model Memory
    Full fine-tuning (fp32)~28 GB
    Full fine-tuning (fp16)~14 GB
    LoRA (fp16 base)~14 GB
    QLoRA (4-bit base)~4-6 GB
    QLoRA makes it possible to fine-tune a 7B model on a single consumer GPU with 8 GB VRAM.

    from transformers import BitsAndBytesConfig

    quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 bnb_4bit_compute_dtype="bfloat16", # Compute in bf16 bnb_4bit_use_double_quant=True, # Double quantization )

    Dataset Preparation

    Fine-tuning data must be formatted in the structure the model expects.

    Instruction Format (Alpaca-style)

    {
      "instruction": "Summarize the following article.",
      "input": "The Federal Reserve announced today that...",
      "output": "The Fed raised interest rates by 0.25% citing..."
    }
    

    Chat Format (ChatML / Llama)

    {
      "messages": [
        {"role": "system", "content": "You are a medical assistant."},
        {"role": "user", "content": "What are the symptoms of flu?"},
        {"role": "assistant", "content": "Common flu symptoms include fever..."}
      ]
    }
    

    Dataset Quality Guidelines

  • Minimum: 100-500 high-quality examples for LoRA
  • Recommended: 1,000-10,000 examples for robust fine-tuning
  • Diversity: Cover all expected input types and edge cases
  • Quality over quantity: 500 excellent examples beat 5,000 noisy ones
  • Deduplication: Remove near-duplicates to prevent memorization
  • Validation split: Hold out 10-20% for evaluation
  • python
    1# Complete LoRA Fine-Tuning with PEFT and Hugging Face
    2from peft import LoraConfig, get_peft_model, TaskType
    3from transformers import (
    4    AutoModelForCausalLM,
    5    AutoTokenizer,
    6    TrainingArguments,
    7    Trainer,
    8    BitsAndBytesConfig,
    9)
    10from datasets import load_dataset
    11import torch
    12
    13# --- Step 1: Load base model with 4-bit quantization (QLoRA) ---
    14model_name = "meta-llama/Llama-2-7b-hf"  # or any HF model
    15
    16quantization_config = BitsAndBytesConfig(
    17    load_in_4bit=True,
    18    bnb_4bit_quant_type="nf4",
    19    bnb_4bit_compute_dtype=torch.bfloat16,
    20    bnb_4bit_use_double_quant=True,
    21)
    22
    23model = AutoModelForCausalLM.from_pretrained(
    24    model_name,
    25    quantization_config=quantization_config,
    26    device_map="auto",
    27)
    28tokenizer = AutoTokenizer.from_pretrained(model_name)
    29tokenizer.pad_token = tokenizer.eos_token
    30
    31# --- Step 2: Configure LoRA ---
    32lora_config = LoraConfig(
    33    task_type=TaskType.CAUSAL_LM,
    34    r=16,                # LoRA rank
    35    lora_alpha=32,       # Scaling factor (typically 2x rank)
    36    lora_dropout=0.05,   # Dropout for regularization
    37    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    38)
    39
    40model = get_peft_model(model, lora_config)
    41model.print_trainable_parameters()
    42# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
    43
    44# --- Step 3: Prepare dataset ---
    45# Example: load and format a dataset
    46def format_instruction(example):
    47    return {
    48        "text": f"### Instruction:\n{example['instruction']}\n\n"
    49                f"### Input:\n{example.get('input', '')}\n\n"
    50                f"### Response:\n{example['output']}"
    51    }
    52
    53dataset = load_dataset("tatsu-lab/alpaca", split="train")
    54dataset = dataset.map(format_instruction)
    55
    56# --- Step 4: Training ---
    57training_args = TrainingArguments(
    58    output_dir="./lora-output",
    59    num_train_epochs=3,
    60    per_device_train_batch_size=4,
    61    gradient_accumulation_steps=4,
    62    learning_rate=2e-4,
    63    warmup_ratio=0.03,
    64    logging_steps=10,
    65    save_strategy="epoch",
    66    fp16=True,
    67)
    68
    69trainer = Trainer(
    70    model=model,
    71    args=training_args,
    72    train_dataset=dataset,
    73    tokenizer=tokenizer,
    74)
    75
    76trainer.train()
    77
    78# --- Step 5: Save LoRA adapter (small file!) ---
    79model.save_pretrained("./my-lora-adapter")
    80# The adapter is only ~10-50 MB, not the full model

    Cost Considerations

    Fine-tuning costs include: (1) GPU compute for training (A100 40GB: ~$1-3/hour on cloud), (2) Storage for checkpoints, (3) Inference cost for serving fine-tuned models (you must host the full base model + adapter). For many use cases, a well-crafted prompt or RAG pipeline is more cost-effective. Only fine-tune when the quality improvement justifies the ongoing infrastructure cost.

    Evaluating Fine-Tuned Models

    Perplexity

    Perplexity measures how "surprised" the model is by the test data. Lower = better.

    import math

    def compute_perplexity(model, tokenizer, text): inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(inputs, labels=inputs["input_ids"]) return math.exp(outputs.loss.item())

    Task-Specific Metrics

    TaskMetrics
    ClassificationAccuracy, F1, Precision, Recall
    GenerationBLEU, ROUGE, human preference
    Instruction-followingWin rate vs base model (human eval)
    Code generationpass@k (functional correctness)

    A/B Testing

    The gold standard: run both the base model and fine-tuned model on the same inputs, then have humans (or a strong judge model) rate which is better. Report the win rate of the fine-tuned model.

    Deploying Fine-Tuned Models

    1. Merge adapter into base model (for simpler deployment):

       merged_model = model.merge_and_unload()
       merged_model.save_pretrained("./merged-model")
       

    2. Serve with adapter (for multi-tenant, multiple adapters on one base model): - Load base model once - Hot-swap LoRA adapters per request - Tools: vLLM, LoRAX, Hugging Face TGI

    3. Quantize for deployment** (GGUF/GPTQ for smaller, faster inference)