Beginner
Deployment means making the model available to real users or internal systems. In production, you care about whether requests finish on time, whether answers are safe enough, and whether the monthly bill stays acceptable. A model that works in a Jupyter notebook is not a product — it becomes one only when wrapped in infrastructure that handles load, failures, and cost.
Main production concerns
- Latency: how long one request takes from the user's perspective (end-to-end). Includes prompt assembly, model inference, and post-processing. Users typically expect <2 seconds for interactive chat.
- Throughput: how many requests you can handle concurrently (measured in requests/sec or tokens/sec).
- Availability: whether the service stays up. Outages in LLM providers can cascade to downstream products.
- Quality consistency: whether outputs remain useful across many users and over time as models or prompts change.
- Cost: prompt tokens, completion tokens, GPU time, and storage. A single GPT-4-class call can cost $0.01–$0.10, and at scale this adds up to thousands per day.
Self-hosted vs. API-based deployment
There are two broad approaches:
- API-based (managed): Use a provider like OpenAI, Anthropic, or Google. Simplest to start, but you give up control over latency, uptime, and data residency. Cost scales linearly with tokens.
- Self-hosted: Run open-source models (Llama, Mistral, Qwen) on your own GPUs or cloud instances. Higher upfront complexity but offers full control over cost, latency, and data privacy.
Real-world example: a customer support assistant that feels magical at 20 users may become unusable at 2,000 users if each answer takes 15 seconds and every retry doubles cost.
Basic request lifecycle
User request
│
▼
API Gateway (auth, rate limiting)
│
▼
Prompt Assembly (template + context)
│
▼
Model Inference (prefill → decode tokens iteratively)
│
▼
Output Validation (safety filter, schema check)
│
▼
Response to User
│
▼
Logging & Metrics (latency, tokens, cost, feedback)
Key metrics to know
- Time to First Token (TTFT): how long until the user sees the first token stream in. Dominated by the prefill phase.
- Tokens per second (TPS): how fast the model generates output tokens during the decode phase.
- p50 / p95 / p99 latency: percentile latencies. p95 means 95% of requests finish within that time. Production SLAs are typically set on p95 or p99.
- Cost per request: total token cost (input + output) plus infrastructure cost amortized per request.
import time
def timed_call(fn, *args, **kwargs):
"""Wrap any function with timing instrumentation."""
start = time.perf_counter()
result = fn(*args, **kwargs)
duration_ms = (time.perf_counter() - start) * 1000
print(f"Call took {duration_ms:.1f}ms")
return result, duration_ms
# Example: track token usage alongside latency
def log_llm_call(response, duration_ms):
usage = response.get("usage", {})
print(f" Prompt tokens: {usage.get('prompt_tokens', 0)}")
print(f" Completion tokens: {usage.get('completion_tokens', 0)}")
print(f" Latency: {duration_ms:.1f}ms")
cost = (usage.get('prompt_tokens', 0) * 0.00001
+ usage.get('completion_tokens', 0) * 0.00003)
print(f" Estimated cost: ${cost:.6f}")
Advanced
Engineering an LLM service means managing a distributed stochastic system. You need rate limits, adaptive routing, streaming, caching, retries, and fallback models. You also need a measurement framework that links technical metrics to business impact. At scale, the difference between naive and optimized serving can be 10–23x in throughput.
Why LLM inference is different from classical ML serving
Traditional ML models produce a fixed-size output in one forward pass. LLMs are autoregressive: they generate tokens one at a time, each depending on all prior tokens. This means:
- A single request may require hundreds of sequential forward passes (one per output token).
- The prefill phase processes the entire prompt in parallel (efficient GPU use), but the decode phase generates one token at a time per sequence (underutilizes the GPU).
- LLM inference is memory-bandwidth bound, not compute bound. Loading model weights from GPU memory to compute cores takes longer than the actual computation on those weights.
- GPU memory consumed scales with model size + sequence length. A 13B-parameter model needs roughly 1 MB of KV-cache state per token per sequence.
The most dangerous production failure is silent wrongness: outputs look polished but are incorrect. Instrument for correctness signals, not only uptime.
PagedAttention and KV-cache management
The KV-cache stores attention key/value tensors computed during prefill so they don't need to be recomputed on each decode step. Naive implementations pre-allocate memory for the maximum possible sequence length, wasting most of it. PagedAttention (introduced by vLLM) borrows ideas from OS virtual memory:
- KV-cache is divided into fixed-size pages (blocks) instead of contiguous buffers.
- Memory is allocated just-in-time as sequences grow, reducing waste to under 4%.
- Enables much larger effective batch sizes, translating directly to higher throughput.
- vLLM achieves up to 23x throughput over naive static batching by combining continuous batching with PagedAttention.
Speculative decoding
A technique to speed up generation by using a small, fast draft model to propose several tokens at once, which the large target model then verifies in a single forward pass. Since verification of N tokens is nearly as fast as generating 1 token (parallel prefill), this can give 2–3x speedup with no quality loss.
- Draft model must be much smaller and faster (e.g., 1B parameters drafting for a 70B target).
- If the target model rejects a drafted token, generation falls back to the normal path from that point.
- Variants include EAGLE (using hidden-state drafting), Medusa (multiple prediction heads), and n-gram speculation.
Serving frameworks
When self-hosting, the choice of serving framework has a major impact on performance:
- vLLM: Open-source, high-throughput server with PagedAttention, continuous batching, speculative decoding, quantization support (GPTQ, AWQ, FP8), tensor/pipeline/expert parallelism, OpenAI-compatible API. The current industry standard for self-hosted serving.
- TGI (Text Generation Inference): Hugging Face's Rust+Python inference server with continuous batching, flash attention, quantization, and token streaming. Powers HF Inference Endpoints.
- TensorRT-LLM: NVIDIA's optimized inference library with FP8/INT4 kernels, inflight batching, and tight GPU integration. Best raw performance on NVIDIA hardware but less flexible.
- Ollama / llama.cpp: Lightweight CPU/GPU inference for local development and edge deployment. Uses GGUF quantized models.
Operational architecture patterns
- Async queues for long requests such as document analysis — decouple request acceptance from completion.
- Streaming tokens for interactive experiences where perceived latency matters. Users see tokens appear as they're generated.
- Semantic caching for repeated or similar questions. Hash normalized prompt + context to avoid redundant inference.
- Dynamic model routing: use a cheap/fast model first, escalate to a larger model only when the task is complex or the small model's confidence is low. Can reduce costs by 50–80%.
- Guardrails before and after inference: input toxicity filtering, PII detection, output schema validation (JSON mode), policy compliance checks.
- Multi-provider failover: if your primary model provider goes down, route to a secondary provider or a self-hosted fallback model.
Production LLM Architecture:
┌─────────────────────────────────────────────┐
│ Load Balancer │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ API Gateway │
│ (auth, rate limiting, request validation) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Router / Orchestrator │
│ (model selection, prompt assembly, cache) │
└───┬──────────────┬──────────────────┬───────┘
│ │ │
┌─────────▼───┐ ┌───────▼──────┐ ┌───────▼──────┐
│ Small Model │ │ Large Model │ │ Fallback │
│ (fast path) │ │ (accuracy) │ │ (static │
│ e.g. 7B │ │ e.g. 70B │ │ response) │
└─────────┬───┘ └───────┬──────┘ └───────┬──────┘
│ │ │
┌───▼──────────────▼──────────────────▼───────┐
│ Output Guardrails │
│ (safety filter, schema validation, PII) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Observability & Logging │
│ (latency, tokens, cost, traces, feedback) │
└─────────────────────────────────────────────┘
from dataclasses import dataclass
@dataclass
class RouteDecision:
model_name: str
reason: str
estimated_cost: float
def choose_model(task_complexity: str, max_latency_ms: int,
input_tokens: int) -> RouteDecision:
"""Route requests to the cheapest model that meets requirements."""
if task_complexity == "low" and max_latency_ms <= 2000:
return RouteDecision(
"small-chat-7b", "fast path",
estimated_cost=input_tokens * 0.000001
)
if task_complexity == "medium":
return RouteDecision(
"mid-instruct-34b", "balanced path",
estimated_cost=input_tokens * 0.000005
)
return RouteDecision(
"large-reasoning-70b", "high accuracy path",
estimated_cost=input_tokens * 0.00002
)
# Example: 80% of requests go to the cheap model
print(choose_model("low", 1500, 500)) # → small-chat-7b
print(choose_model("high", 5000, 500)) # → large-reasoning-70b
Observability for LLM systems
Standard application monitoring (CPU, memory, HTTP codes) is necessary but not sufficient. LLM systems additionally need:
- Prompt version tracking: which template and version was used for each request.
- Retrieval context lineage: which documents or chunks were injected into the prompt (for RAG systems).
- Model and version: exact model ID, checkpoint, or API model version.
- Token accounting: input tokens, output tokens, and cost per request, aggregated by endpoint and user.
- Quality signals: validation pass/fail rates, user thumbs-up/down, automated eval scores.
- Drift detection: monitor output distribution changes that may indicate model degradation or prompt issues.
Cost optimization strategies
- Prompt compression: reduce input tokens by removing unnecessary context, using concise instructions, or applying techniques like LLMLingua.
- Caching: semantic caching (hash normalized prompt) can eliminate 20–40% of duplicate inference calls.
- Model cascading: route simple tasks to cheap models; only escalate complex queries.
- Quantization: FP8 or INT4 quantized models use less memory (enabling larger batch sizes) and run faster with minimal quality loss.
- Batching efficiency: continuous batching + PagedAttention can reduce per-token cost by maximizing GPU utilization.
- Max-token limits: set explicit
max_tokensto prevent runaway generation.
Production checklist
- Set explicit timeouts and retry budgets (e.g., max 2 retries with exponential backoff).
- Log token counts and cost by endpoint, user, and model.
- Store redacted traces for debugging (strip PII from logged prompts).
- Evaluate with a frozen test set before any prompt or model upgrade.
- Design fallback responses when the model or retrieval system fails.
- Implement circuit breakers to stop calling a failing provider.
- Monitor for prompt injection attacks in user inputs.
- Set rate limits per user/API key to prevent abuse and cost overruns.
- Run load tests simulating realistic traffic patterns (variable input/output lengths, bursty arrival).
- Version-control all prompts alongside application code.
import hashlib, json
class SemanticCache:
"""Simple prompt cache keyed by normalized content hash."""
def __init__(self):
self._store = {}
self.hits = 0
self.misses = 0
def _make_key(self, prompt: str, model: str) -> str:
normalized = prompt.strip().lower()
raw = json.dumps({"prompt": normalized, "model": model},
sort_keys=True)
return hashlib.sha256(raw.encode()).hexdigest()
def get(self, prompt: str, model: str):
key = self._make_key(prompt, model)
if key in self._store:
self.hits += 1
return self._store[key]
self.misses += 1
return None
def put(self, prompt: str, model: str, response):
key = self._make_key(prompt, model)
self._store[key] = response
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
To-do list
Learn
- Understand latency metrics: TTFT, TPS, p50/p95/p99 latency, and rate limiting.
- Learn the difference between static batching, continuous batching, and chunked prefill.
- Study PagedAttention and why it enables larger batch sizes and higher throughput.
- Understand speculative decoding: draft models, verification, and when it helps.
- Study why LLM inference is memory-bandwidth bound, not compute bound.
- Learn why observability for LLM apps must include prompts, context lineage, and quality signals.
- Understand the trade-offs between self-hosted (vLLM, TGI) and API-based deployment.
- Learn the difference between transient failures (timeouts, rate limits) and systematic quality failures (model drift, bad prompt).
Practice
- Instrument a mock inference call with timing, token accounting, and cost estimation.
- Write a fallback plan for provider outage, schema failure, and timeout scenarios.
- Calculate the cost difference between routing 100% to GPT-4 vs. cascading 80% through a small model.
- Create an error taxonomy for LLM app logs (transient vs. systematic, user-caused vs. system-caused).
- Benchmark a small (7B) and large (70B) model on the same task and compare latency, quality, and cost.
- Sketch a continuous batching timeline for 5 requests with different completion lengths.
Build
- Create a FastAPI service that wraps an LLM call with timing, token counting, and structured logging.
- Add JSON schema validation, retry logic with exponential backoff, and a deterministic fallback message.
- Implement a prompt cache keyed by normalized prompt content hash.
- Build a model router that sends simple queries to a cheap model and complex queries to a large model.
- Deploy a model with vLLM or TGI and run a load test with varying input/output lengths.
- Build a monitoring dashboard or notebook summarizing latency distributions, token usage, cache hit rate, and failure rates.