Subject 06

Vector databases

Vector databases are specialized data systems for storing high-dimensional vectors, indexing them for fast similarity search, and operating that search reliably at production scale. They combine nearest-neighbor algorithms with database features like filtering, updates, replication, and multi-tenant isolation.

Beginner

What problem does a vector database solve?

A vector database stores vectors and answers similarity questions such as "Which stored items are closest to this query vector?" The challenge is scale: comparing a query against every stored vector becomes too slow once you have hundreds of thousands or millions of records. Vector databases solve that by building indexes that avoid brute-force scans while still returning highly relevant neighbors.

Vector index vs. vector database

Term Main job Typical gap
Vector index Accelerate nearest-neighbor search over vectors Usually does not handle filtering, durability, replication, or access control alone
Vector database Operate vector search as a full data system More moving parts, configuration, and operational trade-offs

FAISS is a classic example of a vector index library. Systems such as Pinecone, Qdrant, Weaviate, Milvus, and pgvector-backed deployments add database capabilities around indexing so applications can manage vector data over time instead of treating search as a one-off algorithm.

Application record
    -> id
    -> vector
    -> metadata
    -> source pointer

Stored in vector database
    -> vector index for similarity search
    -> metadata store for filters
    -> storage/replication for persistence

Simple record structure

records = [
    {
        "id": "doc-001",
        "vector": [0.12, 0.55, -0.20],
        "metadata": {"tenant": "acme", "topic": "warranty", "year": 2025},
        "payload": {"title": "Warranty policy"}
    },
    {
        "id": "doc-002",
        "vector": [0.10, 0.51, -0.18],
        "metadata": {"tenant": "acme", "topic": "returns", "year": 2025},
        "payload": {"title": "Returns policy"}
    }
]

query_vector = [0.11, 0.52, -0.19]

# The database uses an index to find likely nearest neighbors quickly.

Intermediate

Similarity metrics

Nearest-neighbor search depends on how closeness is defined. The metric must match the assumptions of the embedding model and the index configuration.

Metric Idea When it is common
Cosine similarity Compare angle between vectors Text search with normalized embeddings
Dot product Reward both alignment and magnitude Models trained with inner-product objectives
Euclidean distance (L2) Measure straight-line distance Vision or geometric feature spaces

Important: A mismatch between model training objective and database metric can quietly hurt retrieval quality even when the index itself is fast.

Why approximate nearest neighbor search is necessary

Exact search checks every vector, which is feasible for small datasets but expensive for large collections. Approximate nearest neighbor (ANN) methods search only promising regions of the space, trading a small amount of recall for major latency and throughput gains.

Index family How it works Main trade-off
HNSW Navigable graph of neighbors across multiple layers Excellent recall/latency, but memory-heavy
IVF Cluster vectors, then search only selected clusters Fast and scalable, but needs good partitioning
IVF-PQ / PQ Compress vectors into short codes Lower memory usage, some precision loss
LSH Hash similar vectors into the same buckets Very fast for some workloads, less common in modern text search

The key operating knobs differ by index. For HNSW, search breadth affects recall and latency. For IVF, the number of coarse clusters and probes matters. For PQ, codebook size and compression ratio matter. Good teams benchmark these choices rather than assuming one default fits all datasets.

Filtering and data layout

Vector search almost never runs on vectors alone. Real applications need filters such as tenant separation, language, content type, freshness windows, or authorization tags. That means the database must coordinate vector search with structured filtering.

Concern Why it matters Typical lever
Recall Missing close neighbors makes the search system unreliable Index tuning, candidate expansion, exact re-check
Latency Search must fit interactive SLAs ANN parameters, caching, shard layout
Filtering Users often need tenant, security, or time scoping Metadata schema, pre-filter vs post-filter strategy
Updates Knowledge changes over time Upsert, background compaction, freshness layer
upsert_request = {
    "points": [
        {
            "id": "faq-17",
            "vector": [0.32, -0.14, 0.88],
            "metadata": {"tenant": "acme", "region": "us", "status": "published"}
        }
    ]
}

search_request = {
    "vector": [0.30, -0.10, 0.90],
    "top_k": 5,
    "filter": {"tenant": "acme", "status": "published"},
    "include_metadata": True
}
Write path
    application -> validate schema -> store vector + metadata -> update index

Read path
    query vector -> apply filters -> ANN candidate search -> score/rerank -> return top-k

Advanced

Production architecture concerns

A serious vector database is not just an index in memory. It must keep data durable, searchable after restarts, and responsive under changing traffic. That introduces classic distributed systems concerns on top of ANN search.

What to measure

Benchmarking vector databases requires both search quality and systems metrics. Fast answers are not useful if the nearest neighbors are wrong, and accurate answers are not useful if p95 latency breaks your SLA.

Metric What it tells you Common failure signal
Recall@k Whether ANN search is finding the right neighbors Relevant items disappear when index parameters are tightened
p95 / p99 latency Tail responsiveness under realistic traffic Queries occasionally spike far above average
Write freshness How long new vectors take to become searchable Recent updates cannot be found for seconds or minutes
Filter selectivity How restrictive metadata filters are Query cost jumps when filters are broad or highly skewed
Memory per million vectors Infrastructure efficiency of the chosen index HNSW or uncompressed storage becomes too expensive

Common failure modes

Selection checklist

If your workload needs:
    highest recall with enough RAM -> HNSW is often a strong default
    lower memory footprint at larger scale -> consider IVF/PQ variants
    strict tenant isolation -> prioritize namespaces, ACLs, and filter performance
    frequent writes -> verify upsert cost and freshness guarantees
    low-ops deployment -> managed/serverless offerings may matter more than raw ANN speed

Exam framing: Vector databases are best understood as ANN search systems plus database operations. The important trade-off is not just speed versus quality, but speed versus quality versus memory versus operational complexity.

To-do list

Learn

  • Understand the difference between a vector index and a full vector database.
  • Learn when cosine similarity, dot product, and L2 distance are appropriate.
  • Study HNSW, IVF, and PQ at the intuition level and know their main trade-offs.
  • Learn why filtering, durability, replication, and freshness matter in production.

Practice

  • Load a small collection into a local vector database and test multiple similarity metrics.
  • Benchmark exact search against ANN search on a sampled evaluation set.
  • Measure the effect of metadata filters on latency and returned candidates.
  • Compare memory usage for an HNSW-style setup versus a compressed index setup.

Build

  • Build a similarity search service with CRUD support for vector records.
  • Add namespaces or tenant IDs and verify isolation in queries.
  • Create a benchmark script that tracks recall@k and p95 latency together.
  • Design a schema for metadata filters that would hold up under production growth.