Subject 03

Fine-tuning and parameter-efficient tuning

Fine-tuning changes model behavior by training on task-specific examples. Parameter-efficient methods like LoRA reduce memory and compute requirements by training a small set of added weights instead of all parameters.

Beginner

What is fine-tuning?

Fine-tuning takes a pre-trained language model and continues training it on a smaller, task-specific dataset. The model has already learned general language patterns during pre-training; fine-tuning specialises it toward a particular behaviour, domain, or output format. The result is a model that responds more consistently to the target task without needing elaborate prompts every time.

When to fine-tune vs. prompt vs. retrieve

Not every problem requires fine-tuning. Use this decision framework:

These approaches are not mutually exclusive. A fine-tuned model can still use RAG for grounding and prompts for per-request customisation.

Types of fine-tuning

Data preparation basics

The quality and format of your training data matters more than the quantity. Key principles:

Example: a chat-format training sample for a legal document classifier:

training_example = {
    "messages": [
        {"role": "system", "content": "You are a legal document classifier. Respond with only the document type code."},
        {"role": "user", "content": "Classify: NDA between startup and contractor regarding IP rights"},
        {"role": "assistant", "content": "contract_type: nda"}
    ]
}

# Most fine-tuning APIs expect JSONL β€” one JSON object per line
import json
with open("train.jsonl", "w") as f:
    f.write(json.dumps(training_example) + "\n")

Key terminology

Advanced

Full fine-tuning in depth

Full fine-tuning updates all model parameters via backpropagation on your task-specific data. For a 7B parameter model in fp16, this requires roughly 14 GB just for model weights, plus optimizer states (Adam stores two extra copies per parameter), gradients, and activations β€” easily exceeding 80 GB of GPU memory. For 70B+ models, multi-GPU setups with model parallelism become mandatory.

Because every weight can change, full fine-tuning offers maximum expressiveness but also maximum risk of catastrophic forgetting and overfitting.

LoRA β€” Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. The core insight: weight updates during fine-tuning have low intrinsic rank, so they can be decomposed into two small matrices instead of modifying the full weight matrix.

Original weight matrix W ∈ ℝ^(dΓ—k)

During fine-tuning, the update Ξ”W is decomposed:
  Ξ”W = B Γ— A    where  B ∈ ℝ^(dΓ—r)  and  A ∈ ℝ^(rΓ—k)

r β‰ͺ min(d, k)   (typical r = 4, 8, 16, 32, 64)

Forward pass:  h = WΒ·x + (BΒ·A)Β·x Β· (Ξ±/r)

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  W (frozen)  │──────┐
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
       x ───           β”œβ”€β”€β–Ί h = Wx + BAxΒ·(Ξ±/r)
           β”‚           β”‚
  β”Œβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”    β”‚
  β”‚ A  β”‚β†’β”‚ B  β”‚β”€β”€β”€β”€β”˜
  β”‚(rΓ—k)β”‚  β”‚(dΓ—r)β”‚  (trainable)
  β””β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”˜

Key LoRA hyperparameters:

Advantage: at inference time, LoRA matrices can be merged back into the base weights (W' = W + BAΒ·Ξ±/r), adding zero latency. Multiple LoRA adapters can be swapped onto the same base model for different tasks.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # rank of update matrices
    lora_alpha=32,                 # scaling factor (alpha / r)
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    bias="none",
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# β†’ trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%

QLoRA β€” Quantized LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization to enable fine-tuning of large models on consumer GPUs. Key innovations:

With QLoRA, a 65B parameter model can be fine-tuned on a single 48 GB GPU while matching the performance of full 16-bit fine-tuning.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # double quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear", lora_dropout=0.05)
model = get_peft_model(model, lora_config)

Other PEFT methods

While LoRA dominates in practice, several other parameter-efficient methods exist:

PEFT method comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Method           β”‚ Trainable %  β”‚ Memory     β”‚ Inf. latency β”‚ Merge into base β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Full fine-tuning β”‚ 100%         β”‚ Very high  β”‚ None         β”‚ N/A             β”‚
β”‚ LoRA             β”‚ 0.01–0.1%   β”‚ Low        β”‚ None (merge) β”‚ Yes             β”‚
β”‚ QLoRA            β”‚ 0.01–0.1%   β”‚ Very low   β”‚ None (merge) β”‚ Yes             β”‚
β”‚ Adapter layers   β”‚ 1–5%        β”‚ Low        β”‚ Small add    β”‚ No              β”‚
β”‚ Prefix tuning    β”‚ 0.01–1%     β”‚ Low        β”‚ Small add    β”‚ No              β”‚
β”‚ Prompt tuning    β”‚ < 0.01%     β”‚ Very low   β”‚ None         β”‚ No              β”‚
β”‚ (IA)Β³            β”‚ < 0.01%     β”‚ Very low   β”‚ Negligible   β”‚ Yes             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training best practices

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch size = 16
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    bf16=True,
    load_best_model_at_end=True,      # early stopping via best checkpoint
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Data formatting patterns

Different tasks require different data structures:

# Classification (chat format)
{"messages": [
    {"role": "system", "content": "Classify the support ticket into one of: billing, technical, account, other."},
    {"role": "user", "content": "I can't log in after resetting my password"},
    {"role": "assistant", "content": "account"}
]}

# Extraction (structured output)
{"messages": [
    {"role": "user", "content": "Extract entities: 'John Smith signed a 3-year lease at 42 Oak Street on March 1'"},
    {"role": "assistant", "content": "{\"person\": \"John Smith\", \"duration\": \"3 years\", \"address\": \"42 Oak Street\", \"date\": \"March 1\"}"}
]}

# Instruction following (style adaptation)
{"messages": [
    {"role": "user", "content": "Summarise this medical report in plain language for the patient."},
    {"role": "assistant", "content": "Your blood test results are normal. Your cholesterol is slightly elevated..."}
]}

Evaluation strategy

Evaluation must go beyond training loss to determine whether fine-tuning was worthwhile:

Deployment considerations

# Merge and save for deployment
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Or save adapter separately (10-100 MB vs full model)
peft_model.save_pretrained("./lora-adapter")
# Load later:
# model = AutoModelForCausalLM.from_pretrained("base-model")
# model = PeftModel.from_pretrained(model, "./lora-adapter")

To-do list

Learn

  • Understand the difference between full fine-tuning, instruction tuning, and PEFT.
  • Learn what LoRA changes mathematically β€” low-rank decomposition of weight updates.
  • Study how QLoRA combines 4-bit quantization with LoRA to reduce memory requirements.
  • Compare PEFT methods: LoRA, adapter layers, prefix tuning, prompt tuning, (IA)Β³.
  • Study common data formatting styles for chat, classification, and extraction tasks.
  • Learn why label quality matters more than dataset size in many narrow tasks.
  • Understand catastrophic forgetting and how PEFT methods mitigate it.
  • Learn the role of rank (r), lora_alpha, and target_modules in LoRA configuration.

Practice

  • Prepare a small dataset of 200–500 examples in JSONL chat format for a classification task.
  • Fine-tune a small model (e.g., 7B) using LoRA with the HuggingFace PEFT library.
  • Try QLoRA to fine-tune a 13B+ model on a single consumer GPU.
  • Compare prompt-only performance against a parameter-efficient fine-tuned model on your task.
  • Experiment with different rank values (4, 16, 64) and measure quality vs. training cost.
  • Measure whether fine-tuning reduces prompt length and output variability.
  • Inspect examples where the tuned model still fails β€” categorise failure modes.
  • Run regression tests to check that general knowledge has not degraded.

Build

  • Train a LoRA adapter for an internal text classification or extraction task.
  • Create an evaluation report comparing baseline, prompted, RAG-augmented, and tuned variants.
  • Package the adapter with metadata: base model, dataset version, hyperparameters, metrics, and known risks.
  • Set up a multi-adapter serving pipeline β€” one base model with swappable LoRA adapters per task.
  • Build a data curation pipeline with quality checks, deduplication, and format validation.
  • Deploy a demo endpoint using vLLM or TGI that loads the adapter at inference time.