Fine-tuning and Parameter-Efficient Tuning

Beginner

What is fine-tuning?

Fine-tuning takes a pre-trained language model and continues training it on a smaller, task-specific dataset. The model has already learned general language patterns during pre-training; fine-tuning specialises it toward a particular behaviour, domain, or output format. The result is a model that responds more consistently to the target task without needing elaborate prompts every time.

When to fine-tune vs. prompt vs. retrieve

Not every problem requires fine-tuning. Use this decision framework:

Prompting — Use when behaviour can be controlled with instructions and a few examples in the context window. Best for general tasks where the model already has the knowledge.
Retrieval-Augmented Generation (RAG) — Use when the model needs access to knowledge it was not trained on (e.g., internal documents, recent data). The issue is missing information, not behaviour.
Fine-tuning — Use when you need stable formatting, a specific tone or style, domain-specific behaviour, or consistent structured output that prompting alone cannot reliably deliver.

These approaches are not mutually exclusive. A fine-tuned model can still use RAG for grounding and prompts for per-request customisation.

Types of fine-tuning

Full fine-tuning — Updates every parameter in the model. Produces the most expressive adaptation but is the most expensive in compute, memory, and storage.
Instruction tuning — A form of supervised fine-tuning (SFT) using instruction–response pairs. This is how base models are turned into chat-style assistants (e.g., LLaMA → Alpaca).
Parameter-Efficient Fine-Tuning (PEFT) — Only trains a small set of additional or selected parameters while freezing most of the base model. Examples include LoRA, QLoRA, prefix tuning, and prompt tuning.
Alignment tuning (RLHF / DPO) — Adjusts the model to align with human preferences using reinforcement learning from human feedback (RLHF) or Direct Preference Optimisation (DPO). Typically applied after SFT.

Data preparation basics

The quality and format of your training data matters more than the quantity. Key principles:

Match the data format to your inference format — if you will send chat messages at inference time, train on chat-format examples.
Label quality beats dataset size. 200 carefully labelled examples often outperform 5,000 noisy ones.
Include edge cases and negative examples so the model learns boundaries.
Split data into train / validation / test sets (e.g., 80/10/10) and never evaluate on training data.

Example: a chat-format training sample for a legal document classifier:

training_example = {
    "messages": [
        {"role": "system", "content": "You are a legal document classifier. Respond with only the document type code."},
        {"role": "user", "content": "Classify: NDA between startup and contractor regarding IP rights"},
        {"role": "assistant", "content": "contract_type: nda"}
    ]
}

# Most fine-tuning APIs expect JSONL — one JSON object per line
import json
with open("train.jsonl", "w") as f:
    f.write(json.dumps(training_example) + "\n")

Key terminology

Epoch — One full pass through the training dataset.
Learning rate — Controls how much the weights change per update. Too high causes instability; too low makes training slow.
Overfitting — The model memorises training examples instead of generalising. Watch for validation loss increasing while training loss decreases.
Catastrophic forgetting — Fine-tuning erases useful general knowledge from pre-training. PEFT methods mitigate this by keeping most weights frozen.

Advanced

Full fine-tuning in depth

Full fine-tuning updates all model parameters via backpropagation on your task-specific data. For a 7B parameter model in fp16, this requires roughly 14 GB just for model weights, plus optimizer states (Adam stores two extra copies per parameter), gradients, and activations — easily exceeding 80 GB of GPU memory. For 70B+ models, multi-GPU setups with model parallelism become mandatory.

Because every weight can change, full fine-tuning offers maximum expressiveness but also maximum risk of catastrophic forgetting and overfitting.

LoRA — Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. The core insight: weight updates during fine-tuning have low intrinsic rank, so they can be decomposed into two small matrices instead of modifying the full weight matrix.

Original weight matrix W ∈ ℝ^(d×k)

During fine-tuning, the update ΔW is decomposed:
  ΔW = B × A    where  B ∈ ℝ^(d×r)  and  A ∈ ℝ^(r×k)

r ≪ min(d, k)   (typical r = 4, 8, 16, 32, 64)

Forward pass:  h = W·x + (B·A)·x · (α/r)

  ┌──────────┐
  │  W (frozen)  │──────┐
  └──────────┘      │
       x ──┤           ├──► h = Wx + BAx·(α/r)
           │           │
  ┌────┐  ┌────┐    │
  │ A  │→│ B  │────┘
  │(r×k)│  │(d×r)│  (trainable)
  └────┘  └────┘

Key LoRA hyperparameters:

r (rank) — Dimensionality of the low-rank matrices. Lower rank = fewer parameters and faster training, but less expressive. Common values: 8–64.
lora_alpha — Scaling factor. The adapter output is scaled by α/r. Higher alpha amplifies the adapter's influence. A common starting point is alpha = 2×r.
target_modules — Which layers to apply LoRA to. Typically attention projections (q_proj, v_proj, k_proj, o_proj), but applying to MLP layers (gate_proj, up_proj, down_proj) can also help.
lora_dropout — Dropout applied to the LoRA layers for regularisation (commonly 0.05–0.1).

Advantage: at inference time, LoRA matrices can be merged back into the base weights (W' = W + BA·α/r), adding zero latency. Multiple LoRA adapters can be swapped onto the same base model for different tasks.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # rank of update matrices
    lora_alpha=32,                 # scaling factor (alpha / r)
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    bias="none",
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%

QLoRA — Quantized LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization to enable fine-tuning of large models on consumer GPUs. Key innovations:

4-bit NormalFloat (NF4) — A data type optimised for normally distributed weights, yielding better information density than standard 4-bit integers.
Double quantization — Quantizes the quantization constants themselves, saving an additional ~0.37 bits per parameter.
Paged optimizers — Uses unified memory to handle GPU memory spikes by paging optimizer states to CPU RAM when needed.

With QLoRA, a 65B parameter model can be fine-tuned on a single 48 GB GPU while matching the performance of full 16-bit fine-tuning.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # double quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear", lora_dropout=0.05)
model = get_peft_model(model, lora_config)

Other PEFT methods

While LoRA dominates in practice, several other parameter-efficient methods exist:

Adapter layers — Small bottleneck layers inserted between transformer blocks. Adds slight inference latency because the layers cannot be merged into the base weights.
Prefix tuning — Prepends trainable continuous vectors (soft prefixes) to the key and value matrices of every attention layer. No modification of model weights.
Prompt tuning — Prepends a set of learnable embeddings to the input sequence. Simpler than prefix tuning but only effective at scale (10B+ parameters).
(IA)³ — Infused Adapter by Inhibiting and Amplifying Inner Activations — Learns rescaling vectors for keys, values, and feed-forward layers. Even fewer trainable parameters than LoRA.
DoRA (Weight-Decomposed Low-Rank Adaptation) — Decomposes weights into magnitude and direction components, applying LoRA only to the direction. Improves over standard LoRA on many benchmarks.

PEFT method comparison

┌──────────────────┬──────────────┬────────────┬──────────────┬─────────────────┐
│ Method           │ Trainable %  │ Memory     │ Inf. latency │ Merge into base │
├──────────────────┼──────────────┼────────────┼──────────────┼─────────────────┤
│ Full fine-tuning │ 100%         │ Very high  │ None         │ N/A             │
│ LoRA             │ 0.01–0.1%   │ Low        │ None (merge) │ Yes             │
│ QLoRA            │ 0.01–0.1%   │ Very low   │ None (merge) │ Yes             │
│ Adapter layers   │ 1–5%        │ Low        │ Small add    │ No              │
│ Prefix tuning    │ 0.01–1%     │ Low        │ Small add    │ No              │
│ Prompt tuning    │ < 0.01%     │ Very low   │ None         │ No              │
│ (IA)³            │ < 0.01%     │ Very low   │ Negligible   │ Yes             │
└──────────────────┴──────────────┴────────────┴──────────────┴─────────────────┘

Training best practices

Learning rate — Use a lower learning rate than pre-training. For LoRA, typical values are 1e-4 to 2e-4. For full fine-tuning, 1e-5 to 5e-5.
Epochs — 1–3 epochs is often sufficient for instruction tuning. More epochs risk overfitting, especially on small datasets.
Warmup — Use a warmup ratio of 3–10% of total steps to stabilise early training.
Batch size — Use gradient accumulation to simulate larger effective batch sizes when GPU memory is limited.
Data quality > quantity — Curate examples carefully. Remove duplicates, contradictions, and low-quality labels. Bad data damages weights faster than bad prompts damage prompting pipelines.
Validation monitoring — Track validation loss every N steps. Stop training when validation loss plateaus or increases (early stopping).

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch size = 16
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    bf16=True,
    load_best_model_at_end=True,      # early stopping via best checkpoint
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Data formatting patterns

Different tasks require different data structures:

# Classification (chat format)
{"messages": [
    {"role": "system", "content": "Classify the support ticket into one of: billing, technical, account, other."},
    {"role": "user", "content": "I can't log in after resetting my password"},
    {"role": "assistant", "content": "account"}
]}

# Extraction (structured output)
{"messages": [
    {"role": "user", "content": "Extract entities: 'John Smith signed a 3-year lease at 42 Oak Street on March 1'"},
    {"role": "assistant", "content": "{\"person\": \"John Smith\", \"duration\": \"3 years\", \"address\": \"42 Oak Street\", \"date\": \"March 1\"}"}
]}

# Instruction following (style adaptation)
{"messages": [
    {"role": "user", "content": "Summarise this medical report in plain language for the patient."},
    {"role": "assistant", "content": "Your blood test results are normal. Your cholesterol is slightly elevated..."}
]}

Evaluation strategy

Evaluation must go beyond training loss to determine whether fine-tuning was worthwhile:

Baseline comparison — Always compare against (a) the base model with prompting, (b) the base model with RAG, and (c) the fine-tuned model. If prompting alone achieves 95% of the fine-tuned result, the training complexity may not be justified.
Held-out test set — A fixed set of examples the model has never seen during training.
Adversarial / edge cases — Intentionally tricky inputs that probe boundaries (e.g., ambiguous labels, out-of-domain inputs).
Task-specific metrics — Accuracy, F1, exact match, BLEU, or ROUGE depending on the task type.
Regression testing — Verify that general capabilities have not degraded by testing on a sample of unrelated tasks.
Human evaluation — For open-ended generation tasks, automated metrics are insufficient. Use A/B testing or Likert-scale rating by domain experts.

Deployment considerations

LoRA adapters are small files (typically 10–100 MB) that can be versioned alongside your code.
Multiple adapters can share a single base model in production — load the base once and swap adapters per request or tenant.
Merge adapters into the base model (W' = W + BA·α/r) for simpler deployment with zero inference latency overhead.
Track metadata: base model version, training dataset hash, hyperparameters, evaluation metrics, and known limitations.
Use frameworks like vLLM or TGI that support serving LoRA adapters natively for multi-tenant setups.

# Merge and save for deployment
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Or save adapter separately (10-100 MB vs full model)
peft_model.save_pretrained("./lora-adapter")
# Load later:
# model = AutoModelForCausalLM.from_pretrained("base-model")
# model = PeftModel.from_pretrained(model, "./lora-adapter")

To-do list

Learn

Understand the difference between full fine-tuning, instruction tuning, and PEFT.
Learn what LoRA changes mathematically — low-rank decomposition of weight updates.
Study how QLoRA combines 4-bit quantization with LoRA to reduce memory requirements.
Compare PEFT methods: LoRA, adapter layers, prefix tuning, prompt tuning, (IA)³.
Study common data formatting styles for chat, classification, and extraction tasks.
Learn why label quality matters more than dataset size in many narrow tasks.
Understand catastrophic forgetting and how PEFT methods mitigate it.
Learn the role of rank (r), lora_alpha, and target_modules in LoRA configuration.

Practice

Prepare a small dataset of 200–500 examples in JSONL chat format for a classification task.
Fine-tune a small model (e.g., 7B) using LoRA with the HuggingFace PEFT library.
Try QLoRA to fine-tune a 13B+ model on a single consumer GPU.
Compare prompt-only performance against a parameter-efficient fine-tuned model on your task.
Experiment with different rank values (4, 16, 64) and measure quality vs. training cost.
Measure whether fine-tuning reduces prompt length and output variability.
Inspect examples where the tuned model still fails — categorise failure modes.
Run regression tests to check that general knowledge has not degraded.

Build

Train a LoRA adapter for an internal text classification or extraction task.
Create an evaluation report comparing baseline, prompted, RAG-augmented, and tuned variants.
Package the adapter with metadata: base model, dataset version, hyperparameters, metrics, and known risks.
Set up a multi-adapter serving pipeline — one base model with swappable LoRA adapters per task.
Build a data curation pipeline with quality checks, deduplication, and format validation.
Deploy a demo endpoint using vLLM or TGI that loads the adapter at inference time.