Beginner
What is quantization?
Quantization means representing model weights (and sometimes activations) with fewer bits. Instead of storing every number as a 32-bit or 16-bit floating-point value, you compress them down to 8-bit integers, 4-bit integers, or even lower. The model becomes smaller, uses less memory, and often runs faster—at the cost of a small accuracy trade-off.
Simple intuition: imagine a photograph saved at ultra-high resolution versus a compressed JPEG. The JPEG is much smaller and looks almost the same. Quantization does something similar for neural network numbers.
Numeric data types at a glance
Understanding the common data types is fundamental:
- FP32 (float32) – 32-bit floating point. Full precision, used historically for training. 4 bytes per parameter.
- FP16 (float16) – 16-bit floating point. Half as much memory as FP32. Common for GPU inference. 2 bytes per parameter.
- BF16 (bfloat16) – 16-bit “brain float.” Same memory as FP16 but with a wider exponent range, making training more stable. Developed by Google Brain and now widely adopted.
- INT8 – 8-bit integers. 1 byte per parameter. Good balance between speed and accuracy.
- INT4 / NF4 – 4-bit integers or normalized floats. 0.5 bytes per parameter. Aggressive compression; used in tools like GPTQ, AWQ, and bitsandbytes QLoRA.
- INT2 / INT3 – Extreme compression. Significant quality loss; used in specialized research methods like AQLM and QuIP#.
Bit Layout Comparison:
FP32: [S][EEEEEEEE][MMMMMMMMMMMMMMMMMMMMMMM] 32 bits
1 8 23 Sign + 8-bit exponent + 23-bit mantissa
BF16: [S][EEEEEEEE][MMMMMMM] 16 bits
1 8 7 Same exponent range as FP32 (no overflow)
FP16: [S][EEEEE][MMMMMMMMMM] 16 bits
1 5 10 More precision than BF16, but limited range
FP8: [S][EEEE][MMM] (E4M3) 8 bits
1 4 3 Good for inference (more precision)
[S][EEEEE][MM] (E5M2) 8 bits
1 5 2 Good for gradients (wider range)
INT8: [BBBBBBBB] 8 bits = 256 discrete values
INT4: [BBBB] 4 bits = 16 discrete values
NF4: [BBBB] (mapped to normal distribution) 4 bits = 16 optimal quantile values
Why does it matter?
- Memory savings: A 7B-parameter model in FP16 needs ~14 GB. In INT4 it needs ~3.5 GB—a 4× reduction.
- Speed: Smaller data moves faster through memory buses. LLM inference is often memory-bandwidth-bound, so fewer bytes per weight directly reduces latency.
- Cost: Smaller models can run on cheaper hardware (e.g., a consumer GPU instead of an A100).
- Accessibility: Quantization enables running large models on laptops, phones, and edge devices.
Memory footprint formula
def estimate_memory_gb(num_params, bits_per_param):
"""Estimate GPU memory for model weights only (excludes KV cache, activations, overhead)."""
bytes_total = num_params * bits_per_param / 8
return bytes_total / (1024 ** 3)
# Example: 7B model at various precisions
for bits in [32, 16, 8, 4]:
gb = estimate_memory_gb(7_000_000_000, bits)
print(f" {bits:>2}-bit: {gb:.1f} GB")
Memory per model (weights only, approximate)
Model Size β FP32 β FP16/BF16 β INT8 β INT4
ββββββββββββΌβββββββββΌββββββββββββΌββββββββΌββββββ
3B β 12 GB β 6 GB β 3 GB β 1.5 GB
7B β 28 GB β 14 GB β 7 GB β 3.5 GB
13B β 52 GB β 26 GB β 13 GB β 6.5 GB
70B β 280 GB β 140 GB β 70 GB β 35 GB
Remember: Actual GPU memory usage is higher than weight-only estimates. You must also account for KV cache (scales with sequence length and batch size), activation memory, CUDA context (~500 MB–1 GB), and framework overhead.
Intermediate
Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)
There are two fundamental approaches to quantizing a model:
- Post-Training Quantization (PTQ): Take an already-trained model and convert its weights to lower precision. No retraining needed. Fast and convenient, but can lose accuracy at very low bit-widths. Most popular methods (GPTQ, AWQ, bitsandbytes) fall in this category.
- Quantization-Aware Training (QAT): Simulate quantization effects during training or fine-tuning so the model learns to compensate. Produces higher-quality results at low bit-widths but requires a full training pipeline and compute budget.
Exam tip: PTQ is faster and cheaper; QAT gives better quality at extreme compression. For most LLM deployment, PTQ at 4-bit or 8-bit is the standard practice.
The outlier problem
One of the most important concepts in LLM quantization: large language models develop a small number of activation channels with values 10–100× larger than the rest. These outlier features (also called “emergent features”) appear in models above ~6B parameters and concentrate in specific hidden dimensions.
- Fewer than 1% of activation channels contain outliers, but they are essential for model quality.
- Naive INT8 quantization maps both outliers and normal values onto the same 256-value grid. The outliers dominate the range, crushing normal values into just a few grid points.
- LLM.int8() (bitsandbytes): isolates outlier channels and processes them in FP16 while the rest runs in INT8. Nearly lossless.
- SmoothQuant: multiplicatively transfers quantization difficulty from activations (hard to quantize) to weights (easy to quantize) using a per-channel smoothing factor.
- AWQ: identifies salient weight channels via activation magnitudes and scales them before quantization to protect accuracy.
The Outlier Problem Visualized: Activation values before quantization: β β outlier: 85.0 β β β β β β β βββββββββββββββββββββββββββββββββββ normal: -2.0 to 3.0 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Channel 1 2 3 4 5 ... 95 96 97 98 99 β ~1% are outliers With naive INT8 (256 values for range -85 to +85): β’ Each step β 0.67 β values between -2 and 3 get only ~7 distinct levels β’ Massive precision loss for 99% of channels! With LLM.int8() or SmoothQuant: β’ Outliers handled separately (FP16 or smoothed) β’ Normal channels get full INT8 precision in their range
Weight-only vs. weight-and-activation quantization
- Weight-only quantization: Only the stored model weights are quantized. Computations still happen in FP16 at runtime (weights are dequantized on-the-fly). Simpler, widely used (GPTQ, AWQ). Addresses the memory bottleneck.
- Weight-and-activation quantization: Both weights and intermediate activations are kept in low precision. Harder to do well because activation distributions are dynamic and can have outliers. Methods like SmoothQuant handle this by mathematically migrating the quantization difficulty from activations to weights.
Popular quantization methods compared
Method β Type β Bits β Calibration β Speed β Best For ββββββββββββββΌβββββββββββββββΌβββββββΌββββββββββββββΌβββββββββββΌβββββββββββββββββββββββββββββ GPTQ β Weight-only β 2-8 β Required β Fast gen β GPU deployment, text gen AWQ β Weight-only β 4 β Required β Fast gen β GPU deployment, preserves quality bitsandbytes β Weight-only β 4/8 β None (0-shot)β Moderate β Easy setup, QLoRA fine-tuning GGUF (llama.cpp)βWeight-onlyβ 2-8 β None β CPU+GPU β Local/edge, CPU inference HQQ β Weight-only β 1-8 β None (0-shot)β Fast β Quick quantization, no calibration AQLM β Weight-only β 1-2 β Required β Moderate β Extreme 1-2 bit compression SmoothQuant β W+A β 8 β Required β Very fastβ INT8 GPU with tensor cores
GPTQ in depth
GPTQ (Generative Pre-trained Transformer Quantization) uses a one-shot weight quantization algorithm based on approximate second-order information (the Hessian). It processes weights column-by-column and compensates for quantization error by adjusting remaining weights.
- Requires a small calibration dataset (~128 samples) to measure weight importance.
- Supports group-wise quantization (e.g., groups of 128 weights share a scale factor), trading a small increase in model size for better accuracy.
- Serializable—quantized models can be saved and shared (hub models with
-GPTQsuffix). - Uses optimized GPU kernels (ExLlama, ExLlamaV2, Marlin) for fast inference.
# Quantize a model with GPTQ using AutoGPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Calibration dataset (small sample of representative text)
calibration_texts = ["Example calibration text...", ...]
gptq_config = GPTQConfig(
bits=4, # 4-bit quantization
group_size=128, # weights grouped in blocks of 128
dataset=calibration_texts,
tokenizer=tokenizer,
)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=gptq_config, device_map="auto"
)
AWQ (Activation-Aware Weight Quantization)
AWQ observes that not all weights are equally important. By analyzing activation distributions, it identifies the ~1% of “salient” weight channels that matter most and applies a mathematical scaling transformation to protect them during quantization.
- Does not rely on backpropagation or weight reconstruction—making it faster than GPTQ to quantize.
- Generalizes better across domains (code, math, multilingual) because it avoids overfitting to the calibration set.
- Works well for instruction-tuned and multimodal models.
- MLSys 2024 Best Paper Award.
bitsandbytes (zero-shot quantization)
bitsandbytes is the easiest quantization path—it requires no calibration data. Any
model with torch.nn.Linear layers can be quantized instantly on load.
- 8-bit (LLM.int8()): Uses mixed-precision decomposition. Outlier features stay in FP16 while the rest uses INT8. Nearly lossless.
- 4-bit (NF4 / QLoRA): Uses the NormalFloat4 data type, which is information- theoretically optimal for normally distributed weights. Combined with double quantization (quantizing the quantization constants) for further savings.
- Best suited for fine-tuning with LoRA adapters (QLoRA workflow).
- Adapters trained on bitsandbytes bases can be merged back with zero inference performance loss.
# Load a model in 4-bit with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype="bfloat16", # compute in bf16 for stability
bnb_4bit_use_double_quant=True, # double quantization saves ~0.4 bits/param
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
GGUF and llama.cpp
GGUF is a file format designed for running quantized models on CPUs (and optionally offloading layers to GPU). It powers the llama.cpp ecosystem and tools like Ollama, LM Studio, and GPT4All.
- Supports a wide range of quantization types: Q2_K, Q3_K_S/M/L, Q4_0, Q4_K_S/M, Q5_0, Q5_K_S/M, Q6_K, Q8_0, and more.
- The “K” variants use k-quant (importance-based mixed quantization) for better quality at the same average bit-width.
- Can split computation between CPU and GPU layers for flexible hardware usage.
- No Python/PyTorch dependency at runtime—pure C/C++ inference.
Naming convention: Q4_K_M means 4-bit k-quant medium. Lower numbers = smaller file, more quality loss. Q4_K_M and Q5_K_M are popular sweet spots for local deployment.
Advanced
Memory-bound vs. compute-bound inference
Understanding whether your inference is memory-bound or compute-bound is critical for choosing the right optimization:
- Memory-bound (autoregressive decoding): Generating tokens one-by-one at small batch sizes. The bottleneck is loading weights from GPU memory (HBM) into compute cores. Quantization directly helps because it reduces the bytes that must be loaded per token.
- Compute-bound (prefill / large batches): Processing a long prompt or large batch of requests. The bottleneck is arithmetic throughput (FLOPS). Quantization helps less; you may need tensor parallelism or faster hardware instead.
Arithmetic Intensity = FLOPs / Bytes loaded from memory
If intensity < hardware's ops:byte ratio β memory-bound (quantization helps a lot)
If intensity > hardware's ops:byte ratio β compute-bound (need more FLOPS)
Example: A100 has ~312 TFLOPS FP16 and ~2 TB/s bandwidth β ops:byte ratio β 156
Autoregressive decode of a 7B model: intensity β 1-2 β heavily memory-bound
KV cache and its optimization
During autoregressive generation, the model must store key and value tensors for all past tokens in all attention layers. This is the KV cache, and it can dominate memory usage for long sequences or large batches.
- KV cache size = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element
- For Llama-2-7B at FP16 with 4K context: ~1 GB per batch element.
- KV cache quantization: Store cached keys/values in INT8 or FP8 instead of FP16. Reduces KV cache memory by 50% with minimal quality loss.
- Grouped-Query Attention (GQA): Models like Llama 2 70B and Llama 3 use fewer KV heads than query heads, inherently shrinking the KV cache.
- Multi-Query Attention (MQA): Uses a single KV head shared across all query heads. Maximum KV cache savings but may reduce quality.
- Paged Attention (vLLM): Manages KV cache like virtual memory pages, eliminating fragmentation and enabling much higher batch sizes.
Speculative decoding
Speculative decoding uses a small, fast draft model to generate candidate tokens, which the large target model then verifies in parallel. Since verification (a single forward pass over multiple tokens) is much cheaper than sequential generation, this can provide 2–3× speedup without any quality loss.
- The draft model must roughly match the target model’s distribution to achieve high acceptance rates.
- Mathematically guaranteed to produce the exact same output distribution as the target model alone.
- Increasingly built into serving frameworks (vLLM, TGI, Medusa).
Knowledge distillation
Distillation trains a smaller student model to mimic the behavior of a larger teacher model. Unlike quantization, distillation changes the model architecture itself.
- The student learns from the teacher’s soft probability distributions (logits), not just hard labels.
- Can be combined with quantization: distill first, then quantize the student.
- Examples: DistilBERT (40% smaller BERT, 97% performance), Gemma from Gemini, Llama 3.2 1B/3B distilled from 70B.
Serving infrastructure and optimization stacks
Production LLM serving combines many optimizations beyond just quantization:
- vLLM: PagedAttention for KV cache, continuous batching, tensor parallelism. Supports GPTQ, AWQ, bitsandbytes, and FP8.
- TGI (Text Generation Inference): Hugging Face’s serving solution. Flash Attention, continuous batching, quantization support.
- TensorRT-LLM: NVIDIA’s optimized runtime. Fused kernels, FP8/INT4, in-flight batching, multi-GPU via tensor and pipeline parallelism.
- llama.cpp / Ollama: CPU-first with optional GPU offloading. GGUF format.
- ExLlamaV2: Highly optimized GPTQ inference, especially fast for consumer GPUs.
FP8 quantization in depth
FP8 is a hardware-native 8-bit floating-point format supported on NVIDIA Hopper (H100/H200) and Ada Lovelace GPUs. Unlike INT8, FP8 retains a floating-point representation, making it easier to quantize both weights and activations without calibration headaches.
- E4M3 (4 exponent bits, 3 mantissa bits): higher precision, used for forward-pass weights and activations. Dynamic range: ±448. Can represent values without needing the complex outlier handling that INT8 requires.
- E5M2 (5 exponent bits, 2 mantissa bits): wider range at the cost of precision. Primarily used for gradients during FP8 training.
- FP8 achieves near-FP16 quality with 2× memory reduction and significant throughput gains — simpler than INT8 calibration with comparable or better results.
- Increasingly the default inference precision on H100 deployments. Supported by vLLM, TensorRT-LLM, FBGEMM, and torchao.
- FP8 KV-cache quantization is especially valuable: reduces KV-cache memory by 2× with minimal quality impact, enabling larger batch sizes and longer contexts.
FP8 Format Comparison:
E4M3 E5M2
βββββββββββββ βββββββββββββ
Format: βSβEEEEβMMM β βSβEEEEEβMM β
βββββββββββββ βββββββββββββ
Range: Β±448 Β±57,344
Precision: ~3 decimal digits ~2 decimal digits
Use: Inference (fwd pass) Training (gradients)
Practical impact on a 70B model:
ββββββββββββββββ¬βββββββββββ¬ββββββββββ
β Precision β Weights β Quality β
ββββββββββββββββΌβββββββββββΌββββββββββ€
β FP16/BF16 β 140 GB β 100% β
β FP8 (E4M3) β 70 GB β ~99.5% β
β INT8 (calib) β 70 GB β ~99% β
β INT4 (GPTQ) β 35 GB β ~97% β
ββββββββββββββββ΄βββββββββββ΄ββββββββββ
Pruning: structured vs. unstructured
Pruning removes weights or entire structures from a model. Unlike quantization (which reduces precision), pruning reduces the number of parameters. It can be combined with quantization for compound compression.
- Unstructured pruning: Sets individual weights to zero based on magnitude (or other criteria). Can achieve high sparsity (e.g., 50–90%) but requires sparse matrix hardware for actual speedup. Limited GPU support currently.
- Structured pruning: Removes entire rows, columns, attention heads, or even whole layers. Produces dense, smaller models that run on standard hardware without special sparse kernels.
- Wanda (Weights and Activations): A simple, no-retraining pruning method for LLMs. Prunes weights based on the product of weight magnitude and input activation norm. Matches SparseGPT quality with much less compute.
- SparseGPT: Uses approximate second-order information (similar to GPTQ) to optimally prune weights while compensating for the removed ones. Can achieve 50–60% unstructured sparsity with minimal quality loss.
- Layer pruning: Some layers contribute less than others. Removing entire transformer blocks (e.g., middle layers) can reduce model size by 10–30% with manageable quality loss. Used in models like Llama-3.2 1B/3B.
Flash Attention
Flash Attention is an IO-aware attention algorithm that is critical for practical LLM optimization. It computes exact attention (not an approximation) while dramatically reducing GPU memory reads/writes.
- Reduces attention memory from O(n²) to O(n) by computing attention in tiles and never materializing the full n×n attention matrix.
- Provides 2–4× wall-clock speedup over standard attention.
- Mathematically identical output — not an approximation. Just a smarter order of operations.
- Flash Attention 2 improves parallelism across sequence length; Flash Attention 3 optimizes for H100 asynchronous execution.
- Now the default in virtually all LLM frameworks. Complementary to quantization (not a replacement — use both).
Operator fusion and graph optimization
Fusing multiple sequential operations (e.g., linear → bias → GELU activation) into a single GPU kernel avoids intermediate memory writes and kernel launch overhead. This is an orthogonal optimization that compounds with quantization.
- torch.compile: PyTorch’s JIT compiler automatically fuses operations and generates optimized Triton kernels. Can provide 1.5–2× speedup with one line of code.
- TensorRT: NVIDIA’s optimizer that performs layer fusion, precision calibration, kernel auto-tuning, and memory optimization.
- ONNX Runtime: Graph optimizations including constant folding, operator fusion, and hardware-specific optimizations across different backends.
Mixed-precision and emerging techniques
- Mixed quantization (k-quants): Different layers get different bit-widths based on their sensitivity. Attention layers often tolerate less compression than FFN layers.
- SpQR: Stores outlier weights in higher precision while keeping the rest at 3-4 bits. Near-lossless at 3-4 bit average.
- AQLM (Additive Quantization): Uses multi-codebook quantization for extreme compression (1–2 bits) while maintaining reasonable quality.
- QuIP# (Quantized with Incoherence Processing): Uses random orthogonal transformations to make weights more quantization-friendly. State-of-the-art at 2 bits.
Benchmarking quantization trade-offs
import json
benchmarks = [
{"method": "FP16 baseline", "bits": 16, "memory_gb": 14.0, "latency_ms": 2100, "perplexity": 5.47},
{"method": "GPTQ-4bit-g128", "bits": 4, "memory_gb": 4.2, "latency_ms": 980, "perplexity": 5.63},
{"method": "AWQ-4bit", "bits": 4, "memory_gb": 4.0, "latency_ms": 950, "perplexity": 5.60},
{"method": "bnb-NF4", "bits": 4, "memory_gb": 4.5, "latency_ms": 1400, "perplexity": 5.70},
{"method": "GGUF-Q4_K_M", "bits": 4.5,"memory_gb": 4.8, "latency_ms": 1200, "perplexity": 5.58},
{"method": "INT8 (LLM.int8)", "bits": 8, "memory_gb": 7.5, "latency_ms": 1500, "perplexity": 5.48},
]
# Find best method meeting a quality threshold
threshold = 5.65 # max acceptable perplexity
viable = [b for b in benchmarks if b["perplexity"] <= threshold]
best = min(viable, key=lambda r: r["latency_ms"])
print(f"Best viable: {best['method']} β {best['latency_ms']}ms, {best['memory_gb']} GB, ppl={best['perplexity']}")
Measuring quality after quantization
Perplexity alone is not sufficient. A model can have acceptable perplexity but fail on specific task categories. A comprehensive evaluation should include:
- Perplexity on held-out data: WikiText-2 is standard. A ΔPPL < 0.5 vs. FP16 is generally acceptable for 4-bit. Increases > 1.0 indicate problematic quality loss.
- Task-specific benchmarks: MMLU (knowledge), HumanEval (code), GSM8K (math), ARC (reasoning). Different tasks degrade at different rates.
- Long-context evaluation: Quantization errors compound over sequence length. Test at your maximum expected context length.
- Needle-in-a-haystack: Insert specific facts in long contexts and verify the quantized model retrieves them correctly. Tests information retention under compression.
- Multi-turn coherence: Run multi-turn conversation benchmarks. Small per-token errors can accumulate across turns, causing inconsistencies.
- Domain-specific prompts: If your application handles medical, legal, or financial content, test on domain-specific samples — these often rely on precise long-tail knowledge that quantization is most likely to displace.
# Quality evaluation framework after quantization
from dataclasses import dataclass
@dataclass
class QuantEvalResult:
method: str
bits: int
ppl_wikitext2: float
mmlu_acc: float
humaneval_pass1: float
gsm8k_acc: float
quality_verdict: str
def evaluate_quantization(baseline_ppl, quant_ppl, task_scores):
"""Determine if quantization quality is acceptable."""
ppl_delta = quant_ppl - baseline_ppl
degradations = []
if ppl_delta > 1.0:
degradations.append(f"perplexity increase too high: {ppl_delta:.2f}")
for task, (base_score, quant_score) in task_scores.items():
drop = (base_score - quant_score) / base_score * 100
if drop > 5:
degradations.append(f"{task}: {drop:.1f}% degradation")
if not degradations:
return "PASS β safe to deploy"
return f"REVIEW β {'; '.join(degradations)}"
Production mindset: Always benchmark on your actual task. Perplexity is a useful proxy but not sufficient—measure end-task metrics (accuracy, F1, human preference) on representative data. A model with worse perplexity may still perform better on your specific downstream task.
When quantization hurts more
- Smaller models are more sensitive to quantization than larger ones. A 3B model at 4-bit loses proportionally more quality than a 70B model at 4-bit.
- Math, code, and structured reasoning tasks are more sensitive to quantization errors than general text generation or summarization.
- Very long contexts amplify quantization errors because small per-token errors accumulate.
- Outlier features in activations (common in larger models) can cause INT8 quantization to fail catastrophically without outlier handling (which is why LLM.int8() and SmoothQuant exist).
Decision framework for production
Start β ββ Need to fine-tune? ββYesβββΊ bitsandbytes NF4 + QLoRA β β β Merge adapters, then quantize for serving βββΊ β β ββ GPU serving? βββββYesβββ¬ββΊ Latency-critical? ββΊ AWQ or GPTQ (ExLlama) β β β β β βββΊ Cost-critical? ββββΊ GPTQ-4bit, vLLM + batching β β ββ CPU / edge? ββββββYesβββΊ GGUF (Q4_K_M or Q5_K_M via llama.cpp/Ollama) β β β ββ Maximum quality? βYesβββΊ FP16/BF16 or INT8 ββββββββββββββββββββββββββββββ
Key concepts for exams
Critical terms
- Calibration dataset: A small set of representative inputs used by methods like GPTQ and AWQ to determine optimal quantization parameters. Typically ~128 samples.
- Group size: The number of weights sharing a single scale/zero-point. Smaller groups (e.g., 32 or 128) give better accuracy but slightly increase model size.
- Scale and zero-point: Parameters that map floating-point ranges to integer ranges
during quantization:
q = round(x / scale + zero_point). - Symmetric vs. asymmetric quantization: Symmetric centers the quantization range at zero; asymmetric allows an offset (zero-point), fitting skewed distributions better.
- Per-tensor vs. per-channel quantization: Per-channel uses different scale factors for each output channel, giving better accuracy at the cost of more metadata.
- Dynamic vs. static quantization: Dynamic quantization computes activation ranges at runtime; static quantization precomputes them from calibration data.
- Double quantization: Quantizing the quantization constants themselves (used by bitsandbytes NF4). Saves ~0.4 bits per parameter overhead.
Common exam-style questions
- Q: Why is BF16 preferred over FP16 for training? — BF16 has the same exponent range as FP32 (8-bit exponent), so it rarely overflows/underflows, making training more stable. FP16 has only a 5-bit exponent and requires loss scaling.
- Q: What is the main advantage of AWQ over GPTQ? — AWQ protects salient weights via activation-aware scaling rather than weight reconstruction, making it better at preserving quality across diverse tasks and modalities without overfitting the calibration set.
- Q: Why does bitsandbytes use NF4 instead of regular INT4? — Neural network weights follow an approximately normal distribution. NF4 places its 16 quantization levels to optimally represent a normal distribution, minimizing information loss.
- Q: How does speculative decoding guarantee identical output quality? — It uses a rejection sampling scheme: draft tokens are accepted by the target model with probability equal to min(1, p_target / p_draft). Rejected tokens are resampled from an adjusted distribution. This provably produces the exact target distribution.
- Q: Why are larger models less sensitive to quantization? — Larger models have more redundant parameters. Individual weight perturbations from quantization are averaged out across more parameters, resulting in smaller relative errors.
- Q: What is the difference between FP8 E4M3 and E5M2? — E4M3 has 4 exponent bits and 3 mantissa bits (range ±448, higher precision), used for inference activations and weights. E5M2 has 5 exponent bits and 2 mantissa bits (range ±57,344, wider range but less precision), used for gradients during training.
- Q: What is the difference between pruning and quantization? — Quantization reduces the precision of weights (fewer bits per weight). Pruning reduces the number of weights (setting them to zero or removing structures). Both reduce model size but through different mechanisms, and they can be combined.
- Q: Why does Flash Attention reduce memory from O(n²) to O(n)? — Standard attention materializes the full n×n attention matrix. Flash Attention computes attention in tiles, loading small blocks from HBM to SRAM, computing partial softmax/attention, and writing back only the final output — never storing the full intermediate matrix.
- Q: What is the “outlier problem” in quantization, and which methods solve it? — Large LLMs develop activation channels with values 10–100× larger than average. Naive quantization distorts these, causing big quality drops. LLM.int8() isolates outliers in FP16; SmoothQuant transfers difficulty to weights; AWQ scales salient channels to protect them.
- Q: When should you use QAT over PTQ? — Use QAT when PTQ at your target bit-width produces unacceptable quality loss (typically below 4 bits), when you have training compute budget available, or for safety-critical applications where every fraction of a percent matters.
To-do list
Learn
- Understand fp32, fp16, bf16, int8, int4, and NF4 at a conceptual level.
- Learn the difference between memory-bound and compute-bound inference.
- Study post-training quantization (PTQ) vs. quantization-aware training (QAT).
- Learn what KV cache is and why it matters for generation latency.
- Understand the differences between GPTQ, AWQ, bitsandbytes, and GGUF.
- Learn what calibration datasets are and when they are needed.
- Study how speculative decoding works and its mathematical guarantees.
- Understand group size, scale, zero-point, and symmetric vs. asymmetric quantization.
Practice
- Estimate memory footprints for 3B, 7B, 13B, and 70B models at several precisions.
- Load a model with bitsandbytes (4-bit NF4) and compare response quality to FP16.
- Quantize a model with GPTQ using AutoGPTQ and test generation speed.
- Download a GGUF model and run it with llama.cpp or Ollama.
- Record latency, throughput, and quality across precision levels and methods.
- Identify when quantization hurts a task more than expected (try math/code tasks).
- Calculate KV cache memory for different models, batch sizes, and sequence lengths.
Build
- Create a benchmark script comparing GPTQ, AWQ, and bitsandbytes on the same model.
- Build a decision matrix recommending quantization methods for different deployment scenarios (cloud GPU, consumer GPU, CPU, edge).
- Set up a vLLM or TGI serving instance with a quantized model and load-test it.
- Implement QLoRA fine-tuning on a 4-bit base model and evaluate the merged result.
- Package a local inference setup with reproducible benchmark inputs and charts.
- Build a comprehensive quantization evaluation pipeline that tests perplexity, MMLU, code generation, and long-context retrieval before/after quantization.