Skip to main content

Parameter-Efficient Fine-Tuning

LoRA, QLoRA, prefix tuning, prompt tuning, adapters, and the PEFT library

~50 min
Listen to this lesson

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all parameters in a model. For a 7B parameter model, that means storing and updating 7 billion floats — requiring massive GPU memory and producing a full copy of the model for each task. PEFT methods solve this by training only a small fraction of parameters.

Why PEFT?

ApproachTrainable ParamsGPU MemoryStorage per Task
Full Fine-Tuning (7B)7,000,000,000~28 GB+~14 GB
LoRA (rank 16)~4,000,000~16 GB~16 MB
QLoRA (4-bit + LoRA)~4,000,000~6 GB~16 MB
PEFT trains 0.1% or fewer of the total parameters while achieving comparable performance to full fine-tuning.

The Core Insight of PEFT

Pre-trained models already encode vast knowledge. Fine-tuning only needs to make small adjustments. PEFT methods exploit this by learning a small set of new parameters that modify the model's behavior, rather than rewriting all existing parameters.

LoRA: Low-Rank Adaptation

LoRA is the most popular PEFT method. It freezes the original model weights and injects small, trainable low-rank matrices.

How LoRA Works

For a pre-trained weight matrix W (shape d x d):

  • Instead of learning a full update Ī”W (d x d parameters)
  • LoRA decomposes it as Ī”W = A Ɨ B where:
  • - A has shape (d x r) - B has shape (r x d) - r (rank) is much smaller than d (typically 4-64)

    So instead of d² parameters, you learn 2 Ɨ d Ɨ r parameters.

    Original:  h = Wx
    With LoRA: h = Wx + (A Ɨ B)x
               h = Wx + ΔWx

    Where W is frozen, only A and B are trained.

    LoRA with PEFT Library

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import LoraConfig, get_peft_model, TaskType

    Load base model

    model_name = "meta-llama/Llama-2-7b-hf" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )

    Configure LoRA

    lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank: controls expressiveness vs efficiency lora_alpha=32, # Scaling factor (often 2*r) lora_dropout=0.05, # Dropout on LoRA layers target_modules=[ # Which layers to apply LoRA to "q_proj", "v_proj", # Attention query and value projections "k_proj", "o_proj", # Attention key and output projections ], bias="none", # Don't train bias terms )

    Wrap model with LoRA

    model = get_peft_model(model, lora_config)

    Check trainable parameters

    model.print_trainable_parameters()

    trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

    Choosing LoRA Rank

    The rank (r) controls the trade-off between expressiveness and efficiency: - r=4: Minimal, good for simple tasks - r=8-16: Good default, works for most tasks - r=32-64: More expressive, for complex domain adaptation - Higher rank = more parameters = more memory but potentially better performance Start with r=16 and adjust based on results.

    QLoRA: Quantization + LoRA

    QLoRA combines 4-bit quantization of the base model with LoRA adapters, dramatically reducing memory usage:

    from transformers import AutoModelForCausalLM, BitsAndBytesConfig
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    import torch

    4-bit quantization config

    bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 quantization bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, # Nested quantization )

    Load model in 4-bit

    model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map="auto", )

    Prepare for k-bit training

    model = prepare_model_for_kbit_training(model)

    Apply LoRA on top

    lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )

    model = get_peft_model(model, lora_config) model.print_trainable_parameters()

    Now a 7B model fits in ~6GB VRAM!

    Prefix Tuning

    Prefix tuning prepends trainable "virtual tokens" to the input at every layer. These virtual tokens steer the model's behavior without modifying its weights.

    from peft import PrefixTuningConfig, get_peft_model

    config = PrefixTuningConfig( task_type="CAUSAL_LM", num_virtual_tokens=20, # Number of prefix tokens prefix_projection=True, # Use MLP to project prefix embeddings )

    model = get_peft_model(base_model, config) model.print_trainable_parameters()

    Prompt Tuning

    Simpler than prefix tuning — adds trainable embeddings only at the input layer:

    from peft import PromptTuningConfig, get_peft_model, PromptTuningInit

    config = PromptTuningConfig( task_type="CAUSAL_LM", num_virtual_tokens=20, prompt_tuning_init=PromptTuningInit.TEXT, # Initialize from text prompt_tuning_init_text="Classify the sentiment of this text: ", tokenizer_name_or_path="meta-llama/Llama-2-7b-hf", )

    model = get_peft_model(base_model, config)

    Comparison of PEFT Methods

    MethodWhere AppliedTrainable ParamsBest For
    LoRAAttention weights~0.1%General fine-tuning
    QLoRAAttention (4-bit base)~0.1%Memory-constrained
    Prefix TuningAll layers (virtual tokens)~0.1%Generation tasks
    Prompt TuningInput layer only~0.01%Simple classification
    Adapter LayersInserted between layers~1-5%Multi-task serving

    Training with LoRA

    from transformers import TrainingArguments, Trainer
    from datasets import load_dataset

    Prepare dataset (same as regular fine-tuning)

    dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") tokenizer.pad_token = tokenizer.eos_token

    def format_and_tokenize(examples): texts = [ f"### Instruction:\n{inst}\n### Response:\n{out}" for inst, out in zip(examples["instruction"], examples["output"]) ] return tokenizer(texts, truncation=True, max_length=512, padding="max_length")

    tokenized = dataset.map(format_and_tokenize, batched=True)

    Training arguments (same as usual, but faster due to fewer params)

    training_args = TrainingArguments( output_dir="./lora-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, # LoRA can use higher LR warmup_steps=50, logging_steps=25, save_strategy="epoch", fp16=True, report_to="none", )

    trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, tokenizer=tokenizer, )

    trainer.train()

    Saving and Loading PEFT Models

    # Save only the adapter weights (small!)
    model.save_pretrained("./my-lora-adapter")

    The saved directory contains:

    - adapter_config.json (LoRA configuration)

    - adapter_model.safetensors (just the LoRA weights, ~16MB)

    Load adapter on top of base model

    from peft import PeftModel

    base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

    Merging Adapters

    You can merge LoRA weights back into the base model for simplified inference:

    # Merge LoRA weights into base model
    merged_model = model.merge_and_unload()

    Now it's a regular model — no adapter overhead during inference

    merged_model.save_pretrained("./merged-model")

    Load as a normal model (no PEFT needed)

    model = AutoModelForCausalLM.from_pretrained("./merged-model")

    When to Merge

    Merge adapters when: - Deploying to production (simpler serving, no PEFT dependency) - You only need one task-specific model Keep adapters separate when: - Serving multiple tasks from one base model (swap adapters) - You want to continue training the adapter later - Storage is a concern (adapters are tiny vs full model copies)