Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all parameters in a model. For a 7B parameter model, that means storing and updating 7 billion floats — requiring massive GPU memory and producing a full copy of the model for each task. PEFT methods solve this by training only a small fraction of parameters.

Why PEFT?

Approach	Trainable Params	GPU Memory	Storage per Task
Full Fine-Tuning (7B)	7,000,000,000	~28 GB+	~14 GB
LoRA (rank 16)	~4,000,000	~16 GB	~16 MB
QLoRA (4-bit + LoRA)	~4,000,000	~6 GB	~16 MB

PEFT trains 0.1% or fewer of the total parameters while achieving comparable performance to full fine-tuning.

The Core Insight of PEFT

Pre-trained models already encode vast knowledge. Fine-tuning only needs to make small adjustments. PEFT methods exploit this by learning a small set of new parameters that modify the model's behavior, rather than rewriting all existing parameters.

LoRA: Low-Rank Adaptation

LoRA is the most popular PEFT method. It freezes the original model weights and injects small, trainable low-rank matrices.

How LoRA Works

For a pre-trained weight matrix W (shape d x d):

Instead of learning a full update ΔW (d x d parameters)

LoRA decomposes it as ΔW = A × B where:

- A has shape (d x r) - B has shape (r x d) - r (rank) is much smaller than d (typically 4-64)

So instead of d² parameters, you learn 2 × d × r parameters.

Original:  h = Wx
With LoRA: h = Wx + (A × B)x
           h = Wx + ΔWxWhere W is frozen, only A and B are trained.

LoRA with PEFT Library

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                        # Rank: controls expressiveness vs efficiency
    lora_alpha=32,               # Scaling factor (often 2*r)
    lora_dropout=0.05,           # Dropout on LoRA layers
    target_modules=[             # Which layers to apply LoRA to
        "q_proj", "v_proj",      # Attention query and value projections
        "k_proj", "o_proj",      # Attention key and output projections
    ],
    bias="none",                 # Don't train bias terms
)
Wrap model with LoRA
model = get_peft_model(model, lora_config)
Check trainable parameters
model.print_trainable_parameters()
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

Choosing LoRA Rank

The rank (r) controls the trade-off between expressiveness and efficiency: - r=4: Minimal, good for simple tasks - r=8-16: Good default, works for most tasks - r=32-64: More expressive, for complex domain adaptation - Higher rank = more parameters = more memory but potentially better performance Start with r=16 and adjust based on results.

QLoRA: Quantization + LoRA

QLoRA combines 4-bit quantization of the base model with LoRA adapters, dramatically reducing memory usage:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,    # Nested quantization
)
Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
Apply LoRA on top
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Now a 7B model fits in ~6GB VRAM!

Prefix Tuning

Prefix tuning prepends trainable "virtual tokens" to the input at every layer. These virtual tokens steer the model's behavior without modifying its weights.

from peft import PrefixTuningConfig, get_peft_model
config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,       # Number of prefix tokens
    prefix_projection=True,      # Use MLP to project prefix embeddings
)model = get_peft_model(base_model, config)
model.print_trainable_parameters()

Prompt Tuning

Simpler than prefix tuning — adds trainable embeddings only at the input layer:

from peft import PromptTuningConfig, get_peft_model, PromptTuningInit
config = PromptTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,
    prompt_tuning_init=PromptTuningInit.TEXT,    # Initialize from text
    prompt_tuning_init_text="Classify the sentiment of this text: ",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)model = get_peft_model(base_model, config)

Comparison of PEFT Methods

Method	Where Applied	Trainable Params	Best For
LoRA	Attention weights	~0.1%	General fine-tuning
QLoRA	Attention (4-bit base)	~0.1%	Memory-constrained
Prefix Tuning	All layers (virtual tokens)	~0.1%	Generation tasks
Prompt Tuning	Input layer only	~0.01%	Simple classification
Adapter Layers	Inserted between layers	~1-5%	Multi-task serving

Training with LoRA

from transformers import TrainingArguments, Trainer
from datasets import load_dataset
Prepare dataset (same as regular fine-tuning)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
def format_and_tokenize(examples):
    texts = [
        f"### Instruction:\n{inst}\n### Response:\n{out}"
        for inst, out in zip(examples["instruction"], examples["output"])
    ]
    return tokenizer(texts, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(format_and_tokenize, batched=True)
Training arguments (same as usual, but faster due to fewer params)
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,           # LoRA can use higher LR
    warmup_steps=50,
    logging_steps=25,
    save_strategy="epoch",
    fp16=True,
    report_to="none",
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    tokenizer=tokenizer,
)trainer.train()

Saving and Loading PEFT Models

# Save only the adapter weights (small!)
model.save_pretrained("./my-lora-adapter")
The saved directory contains:
- adapter_config.json (LoRA configuration)
- adapter_model.safetensors (just the LoRA weights, ~16MB)
Load adapter on top of base model
from peft import PeftModelbase_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

Merging Adapters

You can merge LoRA weights back into the base model for simplified inference:

# Merge LoRA weights into base model
merged_model = model.merge_and_unload()
Now it's a regular model — no adapter overhead during inference
merged_model.save_pretrained("./merged-model")
Load as a normal model (no PEFT needed)
model = AutoModelForCausalLM.from_pretrained("./merged-model")

When to Merge

Merge adapters when: - Deploying to production (simpler serving, no PEFT dependency) - You only need one task-specific model Keep adapters separate when: - Serving multiple tasks from one base model (swap adapters) - You want to continue training the adapter later - Storage is a concern (adapters are tiny vs full model copies)