Subject 02

LLM deployment and production issues

Moving from a notebook demo to a reliable product is mostly about operational discipline: latency, cost, observability, quality drift, and safe failure behavior. Production LLM systems are distributed, stochastic, and expensive — treat them as such.

Beginner

Deployment means making the model available to real users or internal systems. In production, you care about whether requests finish on time, whether answers are safe enough, and whether the monthly bill stays acceptable. A model that works in a Jupyter notebook is not a product — it becomes one only when wrapped in infrastructure that handles load, failures, and cost.

Main production concerns

Self-hosted vs. API-based deployment

There are two broad approaches:

Real-world example: a customer support assistant that feels magical at 20 users may become unusable at 2,000 users if each answer takes 15 seconds and every retry doubles cost.

Basic request lifecycle

User request
    │
    ▼
API Gateway (auth, rate limiting)
    │
    ▼
Prompt Assembly (template + context)
    │
    ▼
Model Inference (prefill → decode tokens iteratively)
    │
    ▼
Output Validation (safety filter, schema check)
    │
    ▼
Response to User
    │
    ▼
Logging & Metrics (latency, tokens, cost, feedback)
				

Key metrics to know

import time

def timed_call(fn, *args, **kwargs):
    """Wrap any function with timing instrumentation."""
    start = time.perf_counter()
    result = fn(*args, **kwargs)
    duration_ms = (time.perf_counter() - start) * 1000
    print(f"Call took {duration_ms:.1f}ms")
    return result, duration_ms

# Example: track token usage alongside latency
def log_llm_call(response, duration_ms):
    usage = response.get("usage", {})
    print(f"  Prompt tokens: {usage.get('prompt_tokens', 0)}")
    print(f"  Completion tokens: {usage.get('completion_tokens', 0)}")
    print(f"  Latency: {duration_ms:.1f}ms")
    cost = (usage.get('prompt_tokens', 0) * 0.00001
          + usage.get('completion_tokens', 0) * 0.00003)
    print(f"  Estimated cost: ${cost:.6f}")

Advanced

Engineering an LLM service means managing a distributed stochastic system. You need rate limits, adaptive routing, streaming, caching, retries, and fallback models. You also need a measurement framework that links technical metrics to business impact. At scale, the difference between naive and optimized serving can be 10–23x in throughput.

Why LLM inference is different from classical ML serving

Traditional ML models produce a fixed-size output in one forward pass. LLMs are autoregressive: they generate tokens one at a time, each depending on all prior tokens. This means:

The most dangerous production failure is silent wrongness: outputs look polished but are incorrect. Instrument for correctness signals, not only uptime.

PagedAttention and KV-cache management

The KV-cache stores attention key/value tensors computed during prefill so they don't need to be recomputed on each decode step. Naive implementations pre-allocate memory for the maximum possible sequence length, wasting most of it. PagedAttention (introduced by vLLM) borrows ideas from OS virtual memory:

Speculative decoding

A technique to speed up generation by using a small, fast draft model to propose several tokens at once, which the large target model then verifies in a single forward pass. Since verification of N tokens is nearly as fast as generating 1 token (parallel prefill), this can give 2–3x speedup with no quality loss.

Serving frameworks

When self-hosting, the choice of serving framework has a major impact on performance:

Operational architecture patterns

Production LLM Architecture:

                    ┌─────────────────────────────────────────────┐
                    │              Load Balancer                  │
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │           API Gateway                       │
                    │  (auth, rate limiting, request validation)  │
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │           Router / Orchestrator              │
                    │  (model selection, prompt assembly, cache)   │
                    └───┬──────────────┬──────────────────┬───────┘
                        │              │                  │
              ┌─────────▼───┐  ┌───────▼──────┐  ┌───────▼──────┐
              │ Small Model  │  │ Large Model  │  │  Fallback    │
              │ (fast path)  │  │ (accuracy)   │  │  (static     │
              │ e.g. 7B      │  │ e.g. 70B     │  │   response)  │
              └─────────┬───┘  └───────┬──────┘  └───────┬──────┘
                        │              │                  │
                    ┌───▼──────────────▼──────────────────▼───────┐
                    │         Output Guardrails                   │
                    │  (safety filter, schema validation, PII)    │
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │       Observability & Logging               │
                    │  (latency, tokens, cost, traces, feedback)  │
                    └─────────────────────────────────────────────┘
				
from dataclasses import dataclass

@dataclass
class RouteDecision:
    model_name: str
    reason: str
    estimated_cost: float

def choose_model(task_complexity: str, max_latency_ms: int,
                 input_tokens: int) -> RouteDecision:
    """Route requests to the cheapest model that meets requirements."""
    if task_complexity == "low" and max_latency_ms <= 2000:
        return RouteDecision(
            "small-chat-7b", "fast path",
            estimated_cost=input_tokens * 0.000001
        )
    if task_complexity == "medium":
        return RouteDecision(
            "mid-instruct-34b", "balanced path",
            estimated_cost=input_tokens * 0.000005
        )
    return RouteDecision(
        "large-reasoning-70b", "high accuracy path",
        estimated_cost=input_tokens * 0.00002
    )

# Example: 80% of requests go to the cheap model
print(choose_model("low", 1500, 500))   # → small-chat-7b
print(choose_model("high", 5000, 500))  # → large-reasoning-70b

Observability for LLM systems

Standard application monitoring (CPU, memory, HTTP codes) is necessary but not sufficient. LLM systems additionally need:

Cost optimization strategies

Production checklist

import hashlib, json

class SemanticCache:
    """Simple prompt cache keyed by normalized content hash."""
    def __init__(self):
        self._store = {}
        self.hits = 0
        self.misses = 0

    def _make_key(self, prompt: str, model: str) -> str:
        normalized = prompt.strip().lower()
        raw = json.dumps({"prompt": normalized, "model": model},
                         sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

    def get(self, prompt: str, model: str):
        key = self._make_key(prompt, model)
        if key in self._store:
            self.hits += 1
            return self._store[key]
        self.misses += 1
        return None

    def put(self, prompt: str, model: str, response):
        key = self._make_key(prompt, model)
        self._store[key] = response

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

To-do list

Learn

  • Understand latency metrics: TTFT, TPS, p50/p95/p99 latency, and rate limiting.
  • Learn the difference between static batching, continuous batching, and chunked prefill.
  • Study PagedAttention and why it enables larger batch sizes and higher throughput.
  • Understand speculative decoding: draft models, verification, and when it helps.
  • Study why LLM inference is memory-bandwidth bound, not compute bound.
  • Learn why observability for LLM apps must include prompts, context lineage, and quality signals.
  • Understand the trade-offs between self-hosted (vLLM, TGI) and API-based deployment.
  • Learn the difference between transient failures (timeouts, rate limits) and systematic quality failures (model drift, bad prompt).

Practice

  • Instrument a mock inference call with timing, token accounting, and cost estimation.
  • Write a fallback plan for provider outage, schema failure, and timeout scenarios.
  • Calculate the cost difference between routing 100% to GPT-4 vs. cascading 80% through a small model.
  • Create an error taxonomy for LLM app logs (transient vs. systematic, user-caused vs. system-caused).
  • Benchmark a small (7B) and large (70B) model on the same task and compare latency, quality, and cost.
  • Sketch a continuous batching timeline for 5 requests with different completion lengths.

Build

  • Create a FastAPI service that wraps an LLM call with timing, token counting, and structured logging.
  • Add JSON schema validation, retry logic with exponential backoff, and a deterministic fallback message.
  • Implement a prompt cache keyed by normalized prompt content hash.
  • Build a model router that sends simple queries to a cheap model and complex queries to a large model.
  • Deploy a model with vLLM or TGI and run a load test with varying input/output lengths.
  • Build a monitoring dashboard or notebook summarizing latency distributions, token usage, cache hit rate, and failure rates.