Subject 04

Quantization and model optimization

Model optimization is the practice of making inference cheaper and faster while preserving enough quality for the task. Quantization is one of the most important tools in that toolbox.

Beginner

What is quantization?

Quantization means representing model weights (and sometimes activations) with fewer bits. Instead of storing every number as a 32-bit or 16-bit floating-point value, you compress them down to 8-bit integers, 4-bit integers, or even lower. The model becomes smaller, uses less memory, and often runs faster—at the cost of a small accuracy trade-off.

Simple intuition: imagine a photograph saved at ultra-high resolution versus a compressed JPEG. The JPEG is much smaller and looks almost the same. Quantization does something similar for neural network numbers.

Numeric data types at a glance

Understanding the common data types is fundamental:

Bit Layout Comparison:

FP32:  [S][EEEEEEEE][MMMMMMMMMMMMMMMMMMMMMMM]   32 bits
        1     8                23                Sign + 8-bit exponent + 23-bit mantissa

BF16:  [S][EEEEEEEE][MMMMMMM]                    16 bits
        1     8          7                        Same exponent range as FP32 (no overflow)

FP16:  [S][EEEEE][MMMMMMMMMM]                    16 bits
        1    5        10                          More precision than BF16, but limited range

FP8:   [S][EEEE][MMM]  (E4M3)                     8 bits
        1    4     3                              Good for inference (more precision)
       [S][EEEEE][MM]  (E5M2)                     8 bits
        1    5     2                              Good for gradients (wider range)

INT8:  [BBBBBBBB]                                  8 bits = 256 discrete values
INT4:  [BBBB]                                      4 bits =  16 discrete values
NF4:   [BBBB]  (mapped to normal distribution)     4 bits =  16 optimal quantile values
				

Why does it matter?

Memory footprint formula

def estimate_memory_gb(num_params, bits_per_param):
    """Estimate GPU memory for model weights only (excludes KV cache, activations, overhead)."""
    bytes_total = num_params * bits_per_param / 8
    return bytes_total / (1024 ** 3)

# Example: 7B model at various precisions
for bits in [32, 16, 8, 4]:
    gb = estimate_memory_gb(7_000_000_000, bits)
    print(f"  {bits:>2}-bit: {gb:.1f} GB")
Memory per model (weights only, approximate)

Model Size β”‚ FP32   β”‚ FP16/BF16 β”‚ INT8  β”‚ INT4
───────────┼────────┼───────────┼───────┼──────
    3B     β”‚ 12 GB  β”‚   6 GB    β”‚ 3 GB  β”‚ 1.5 GB
    7B     β”‚ 28 GB  β”‚  14 GB    β”‚ 7 GB  β”‚ 3.5 GB
   13B     β”‚ 52 GB  β”‚  26 GB    β”‚ 13 GB β”‚ 6.5 GB
   70B     β”‚ 280 GB β”‚ 140 GB    β”‚ 70 GB β”‚ 35 GB

Remember: Actual GPU memory usage is higher than weight-only estimates. You must also account for KV cache (scales with sequence length and batch size), activation memory, CUDA context (~500 MB–1 GB), and framework overhead.

Intermediate

Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)

There are two fundamental approaches to quantizing a model:

Exam tip: PTQ is faster and cheaper; QAT gives better quality at extreme compression. For most LLM deployment, PTQ at 4-bit or 8-bit is the standard practice.

The outlier problem

One of the most important concepts in LLM quantization: large language models develop a small number of activation channels with values 10–100× larger than the rest. These outlier features (also called “emergent features”) appear in models above ~6B parameters and concentrate in specific hidden dimensions.

The Outlier Problem Visualized:

Activation values before quantization:
β”‚                                                             β–Œ outlier: 85.0
β”‚                                                             β–Œ
β”‚                                                             β–Œ
β”‚                                                             β–Œ
β”‚  β–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œβ–Œ  normal: -2.0 to 3.0
└──────────────────────────────────────────────────────────────
  Channel 1  2  3  4  5 ... 95 96 97 98 99  ← ~1% are outliers

With naive INT8 (256 values for range -85 to +85):
  β€’ Each step β‰ˆ 0.67  β†’  values between -2 and 3 get only ~7 distinct levels
  β€’ Massive precision loss for 99% of channels!

With LLM.int8() or SmoothQuant:
  β€’ Outliers handled separately (FP16 or smoothed)
  β€’ Normal channels get full INT8 precision in their range
				

Weight-only vs. weight-and-activation quantization

Popular quantization methods compared

Method       β”‚ Type         β”‚ Bits β”‚ Calibration β”‚ Speed    β”‚ Best For
─────────────┼──────────────┼──────┼─────────────┼──────────┼─────────────────────────────
GPTQ         β”‚ Weight-only  β”‚ 2-8  β”‚ Required    β”‚ Fast gen β”‚ GPU deployment, text gen
AWQ          β”‚ Weight-only  β”‚ 4    β”‚ Required    β”‚ Fast gen β”‚ GPU deployment, preserves quality
bitsandbytes β”‚ Weight-only  β”‚ 4/8  β”‚ None (0-shot)β”‚ Moderate β”‚ Easy setup, QLoRA fine-tuning
GGUF (llama.cpp)β”‚Weight-onlyβ”‚ 2-8  β”‚ None        β”‚ CPU+GPU  β”‚ Local/edge, CPU inference
HQQ          β”‚ Weight-only  β”‚ 1-8  β”‚ None (0-shot)β”‚ Fast     β”‚ Quick quantization, no calibration
AQLM         β”‚ Weight-only  β”‚ 1-2  β”‚ Required    β”‚ Moderate β”‚ Extreme 1-2 bit compression
SmoothQuant  β”‚ W+A          β”‚ 8    β”‚ Required    β”‚ Very fastβ”‚ INT8 GPU with tensor cores

GPTQ in depth

GPTQ (Generative Pre-trained Transformer Quantization) uses a one-shot weight quantization algorithm based on approximate second-order information (the Hessian). It processes weights column-by-column and compensates for quantization error by adjusting remaining weights.

# Quantize a model with GPTQ using AutoGPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Calibration dataset (small sample of representative text)
calibration_texts = ["Example calibration text...", ...]

gptq_config = GPTQConfig(
    bits=4,                     # 4-bit quantization
    group_size=128,             # weights grouped in blocks of 128
    dataset=calibration_texts,
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=gptq_config, device_map="auto"
)

AWQ (Activation-Aware Weight Quantization)

AWQ observes that not all weights are equally important. By analyzing activation distributions, it identifies the ~1% of “salient” weight channels that matter most and applies a mathematical scaling transformation to protect them during quantization.

bitsandbytes (zero-shot quantization)

bitsandbytes is the easiest quantization path—it requires no calibration data. Any model with torch.nn.Linear layers can be quantized instantly on load.

# Load a model in 4-bit with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4
    bnb_4bit_compute_dtype="bfloat16",  # compute in bf16 for stability
    bnb_4bit_use_double_quant=True,     # double quantization saves ~0.4 bits/param
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

GGUF and llama.cpp

GGUF is a file format designed for running quantized models on CPUs (and optionally offloading layers to GPU). It powers the llama.cpp ecosystem and tools like Ollama, LM Studio, and GPT4All.

Naming convention: Q4_K_M means 4-bit k-quant medium. Lower numbers = smaller file, more quality loss. Q4_K_M and Q5_K_M are popular sweet spots for local deployment.

Advanced

Memory-bound vs. compute-bound inference

Understanding whether your inference is memory-bound or compute-bound is critical for choosing the right optimization:

Arithmetic Intensity = FLOPs / Bytes loaded from memory

If intensity < hardware's ops:byte ratio β†’ memory-bound (quantization helps a lot)
If intensity > hardware's ops:byte ratio β†’ compute-bound  (need more FLOPS)

Example: A100 has ~312 TFLOPS FP16 and ~2 TB/s bandwidth β†’ ops:byte ratio β‰ˆ 156
         Autoregressive decode of a 7B model: intensity β‰ˆ 1-2 β†’ heavily memory-bound

KV cache and its optimization

During autoregressive generation, the model must store key and value tensors for all past tokens in all attention layers. This is the KV cache, and it can dominate memory usage for long sequences or large batches.

Speculative decoding

Speculative decoding uses a small, fast draft model to generate candidate tokens, which the large target model then verifies in parallel. Since verification (a single forward pass over multiple tokens) is much cheaper than sequential generation, this can provide 2–3× speedup without any quality loss.

Knowledge distillation

Distillation trains a smaller student model to mimic the behavior of a larger teacher model. Unlike quantization, distillation changes the model architecture itself.

Serving infrastructure and optimization stacks

Production LLM serving combines many optimizations beyond just quantization:

FP8 quantization in depth

FP8 is a hardware-native 8-bit floating-point format supported on NVIDIA Hopper (H100/H200) and Ada Lovelace GPUs. Unlike INT8, FP8 retains a floating-point representation, making it easier to quantize both weights and activations without calibration headaches.

FP8 Format Comparison:

              E4M3                    E5M2
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 Format: β”‚Sβ”‚EEEEβ”‚MMM β”‚         β”‚Sβ”‚EEEEEβ”‚MM β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Range:   Β±448                   Β±57,344
 Precision: ~3 decimal digits    ~2 decimal digits
 Use:     Inference (fwd pass)   Training (gradients)

Practical impact on a 70B model:
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ Precision    β”‚ Weights  β”‚ Quality β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ FP16/BF16    β”‚ 140 GB   β”‚ 100%    β”‚
 β”‚ FP8 (E4M3)  β”‚  70 GB   β”‚ ~99.5%  β”‚
 β”‚ INT8 (calib) β”‚  70 GB   β”‚ ~99%    β”‚
 β”‚ INT4 (GPTQ)  β”‚  35 GB   β”‚ ~97%    β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
				

Pruning: structured vs. unstructured

Pruning removes weights or entire structures from a model. Unlike quantization (which reduces precision), pruning reduces the number of parameters. It can be combined with quantization for compound compression.

Flash Attention

Flash Attention is an IO-aware attention algorithm that is critical for practical LLM optimization. It computes exact attention (not an approximation) while dramatically reducing GPU memory reads/writes.

Operator fusion and graph optimization

Fusing multiple sequential operations (e.g., linear → bias → GELU activation) into a single GPU kernel avoids intermediate memory writes and kernel launch overhead. This is an orthogonal optimization that compounds with quantization.

Mixed-precision and emerging techniques

Benchmarking quantization trade-offs

import json

benchmarks = [
    {"method": "FP16 baseline", "bits": 16, "memory_gb": 14.0, "latency_ms": 2100, "perplexity": 5.47},
    {"method": "GPTQ-4bit-g128", "bits": 4, "memory_gb": 4.2, "latency_ms": 980,  "perplexity": 5.63},
    {"method": "AWQ-4bit",       "bits": 4, "memory_gb": 4.0, "latency_ms": 950,  "perplexity": 5.60},
    {"method": "bnb-NF4",        "bits": 4, "memory_gb": 4.5, "latency_ms": 1400, "perplexity": 5.70},
    {"method": "GGUF-Q4_K_M",   "bits": 4.5,"memory_gb": 4.8, "latency_ms": 1200, "perplexity": 5.58},
    {"method": "INT8 (LLM.int8)", "bits": 8, "memory_gb": 7.5, "latency_ms": 1500, "perplexity": 5.48},
]

# Find best method meeting a quality threshold
threshold = 5.65  # max acceptable perplexity
viable = [b for b in benchmarks if b["perplexity"] <= threshold]
best = min(viable, key=lambda r: r["latency_ms"])
print(f"Best viable: {best['method']} β€” {best['latency_ms']}ms, {best['memory_gb']} GB, ppl={best['perplexity']}")

Measuring quality after quantization

Perplexity alone is not sufficient. A model can have acceptable perplexity but fail on specific task categories. A comprehensive evaluation should include:

# Quality evaluation framework after quantization
from dataclasses import dataclass

@dataclass
class QuantEvalResult:
    method: str
    bits: int
    ppl_wikitext2: float
    mmlu_acc: float
    humaneval_pass1: float
    gsm8k_acc: float
    quality_verdict: str

def evaluate_quantization(baseline_ppl, quant_ppl, task_scores):
    """Determine if quantization quality is acceptable."""
    ppl_delta = quant_ppl - baseline_ppl
    degradations = []

    if ppl_delta > 1.0:
        degradations.append(f"perplexity increase too high: {ppl_delta:.2f}")
    for task, (base_score, quant_score) in task_scores.items():
        drop = (base_score - quant_score) / base_score * 100
        if drop > 5:
            degradations.append(f"{task}: {drop:.1f}% degradation")

    if not degradations:
        return "PASS β€” safe to deploy"
    return f"REVIEW β€” {'; '.join(degradations)}"

Production mindset: Always benchmark on your actual task. Perplexity is a useful proxy but not sufficient—measure end-task metrics (accuracy, F1, human preference) on representative data. A model with worse perplexity may still perform better on your specific downstream task.

When quantization hurts more

Decision framework for production

Start
  β”‚
  β”œβ”€ Need to fine-tune? ──Yes──► bitsandbytes NF4 + QLoRA
  β”‚                                    β”‚
  β”‚                              Merge adapters, then quantize for serving ──►
  β”‚                                                                          β”‚
  β”œβ”€ GPU serving? ─────Yes──┬─► Latency-critical? ─► AWQ or GPTQ (ExLlama)  β”‚
  β”‚                         β”‚                                                β”‚
  β”‚                         └─► Cost-critical?  ───► GPTQ-4bit, vLLM + batching
  β”‚                                                                          β”‚
  β”œβ”€ CPU / edge? ──────Yes──► GGUF (Q4_K_M or Q5_K_M via llama.cpp/Ollama)  β”‚
  β”‚                                                                          β”‚
  └─ Maximum quality? ─Yes──► FP16/BF16 or INT8 β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key concepts for exams

Critical terms

Common exam-style questions

To-do list

Learn

  • Understand fp32, fp16, bf16, int8, int4, and NF4 at a conceptual level.
  • Learn the difference between memory-bound and compute-bound inference.
  • Study post-training quantization (PTQ) vs. quantization-aware training (QAT).
  • Learn what KV cache is and why it matters for generation latency.
  • Understand the differences between GPTQ, AWQ, bitsandbytes, and GGUF.
  • Learn what calibration datasets are and when they are needed.
  • Study how speculative decoding works and its mathematical guarantees.
  • Understand group size, scale, zero-point, and symmetric vs. asymmetric quantization.

Practice

  • Estimate memory footprints for 3B, 7B, 13B, and 70B models at several precisions.
  • Load a model with bitsandbytes (4-bit NF4) and compare response quality to FP16.
  • Quantize a model with GPTQ using AutoGPTQ and test generation speed.
  • Download a GGUF model and run it with llama.cpp or Ollama.
  • Record latency, throughput, and quality across precision levels and methods.
  • Identify when quantization hurts a task more than expected (try math/code tasks).
  • Calculate KV cache memory for different models, batch sizes, and sequence lengths.

Build

  • Create a benchmark script comparing GPTQ, AWQ, and bitsandbytes on the same model.
  • Build a decision matrix recommending quantization methods for different deployment scenarios (cloud GPU, consumer GPU, CPU, edge).
  • Set up a vLLM or TGI serving instance with a quantized model and load-test it.
  • Implement QLoRA fine-tuning on a 4-bit base model and evaluate the merged result.
  • Package a local inference setup with reproducible benchmark inputs and charts.
  • Build a comprehensive quantization evaluation pipeline that tests perplexity, MMLU, code generation, and long-context retrieval before/after quantization.