Subject 08

Document chunking and indexing

Chunking decides how documents are split into searchable units. Indexing decides how those units are stored and retrieved. Small choices here can dominate overall RAG quality.

Beginner

Why chunking exists

A long document is usually too large and too mixed in topic to store as one searchable unit. Chunking breaks it into smaller pieces so the system can match a question to the part of the document that actually answers it.

The goal is not just to fit model limits. Good chunks also improve precision. A 20-page handbook embedded as one block produces a blurry representation of many topics at once, while smaller coherent chunks preserve one idea, section, or procedure at a time.

Core trade-offs

Good chunking target

One chunk = one self-contained idea
  -> understandable on its own
  -> small enough to retrieve precisely
  -> large enough to answer a question
  -> linked back to document and neighbors

Simple fixed-size chunking

Fixed-size chunking is the fastest baseline. It works well when documents are messy or inconsistent, but it can split through sentences or headings. In practice, many teams start with token-based or character-based chunking, then move to structure-aware rules if quality is weak.

def chunk_text(text, size=80, overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

print(chunk_text("A" * 220, size=50, overlap=10))

Useful starting point: for long prose, a common first pass is roughly 300-600 tokens per chunk with 10-25% overlap, then adjust based on your documents and queries. Structured documents often need less overlap because headings already preserve context.

What indexing means here

After chunking, each chunk becomes an indexable record. Indexing is the step where you assign identifiers, capture metadata, and store the chunk in a searchable form. Even a strong chunking policy becomes hard to use if the index cannot tell you which document a chunk came from, where it appeared, or which nearby chunks sit before and after it.

Document -> parsed sections -> chunks -> chunk records -> searchable index

Chunk record usually contains:
  chunk_id
  doc_id
  chunk_index
  section_title
  text
  source path or URL
  timestamps / tenant / permissions

Real-world example: a product specification document may mention battery life in one section and environmental operating range in another. Good chunking plus clean indexing helps the system retrieve the right section instead of the whole manual, and still lets you trace the answer back to its original source.

Advanced

Structure-aware and semantic chunking

Useful chunking is structure-aware. Instead of splitting blindly every N characters, prefer boundaries such as headings, paragraphs, tables, code blocks, list items, or speaker turns. For dense material, semantic chunking can go further by detecting topic shifts across sentences rather than relying only on formatting.

Common strategies

Fixed-size         -> fastest baseline, weakest boundaries
Recursive split    -> uses paragraphs/sentences first, size limits second
Structure-aware    -> follows headings, tables, code, speaker turns
Semantic           -> groups sentences by meaning shifts
Hierarchical       -> keeps parent/child chunks at multiple sizes

Hierarchical chunking is especially useful for large manuals, textbooks, or policies. You may store a large parent section for context and smaller child chunks for precision. The index must then preserve those relationships so a system can retrieve the narrow child chunk and optionally expand to the parent when more context is needed.

Metadata design

Rich metadata can be as important as the text itself. Metadata supports traceability, filtering, access control, reindexing, and chunk expansion.

Raw document
  -> parser
  -> structural segments
  -> chunk builder
  -> metadata enrichment
				  -> indexing pipeline
  -> searchable corpus

Index record example

chunk_record = {
    "doc_id": "policy-2026-03",
    "chunk_id": "policy-2026-03::4",
    "section": "Vacation Accrual",
    "chunk_index": 4,
    "parent_section_id": "vacation-accrual",
    "text": "Full-time employees accrue 20 vacation days annually.",
    "source_url": "hr/policies/vacation",
    "created_at": "2026-03-08",
    "chunk_version": "v2"
}

print(chunk_record)

Failure modes to watch

Choosing a policy in practice

If documents are short and atomic
  -> do not chunk, index each document directly

If documents are long but well structured
  -> split by headings/paragraphs, then cap size

If documents are messy or OCR-heavy
  -> start with recursive or fixed-size chunking

If questions need both precision and broad context
  -> store child chunks plus parent sections
def build_chunk_records(doc_id, title, sections):
    records = []
    for section_index, section in enumerate(sections):
        heading = section["heading"]
        paragraphs = section["paragraphs"]

        for chunk_index, paragraph in enumerate(paragraphs):
            records.append({
                "chunk_id": f"{doc_id}::{section_index}::{chunk_index}",
                "doc_id": doc_id,
                "title": title,
                "section_heading": heading,
                "chunk_index": chunk_index,
                "text": paragraph,
            })
    return records

Indexing should support downstream behavior, not just storage. If users filter by region, date, author, product line, or access tier, those fields must be attached during indexing rather than reconstructed later. Strong chunking and strong indexing work together: one shapes the text unit, the other preserves its identity and context.

Practical rule: inspect retrieved chunks manually. If a chunk does not make sense on its own, or if three neighboring chunks are almost always needed together, your chunk size, overlap, or boundary policy probably needs revision.

To-do list

Learn

  • Understand size, overlap, and structure-aware chunking trade-offs.
  • Learn when not to chunk and when to use hierarchical parent-child chunks.
  • Learn which metadata fields help retrieval and auditing.
  • Study how index record design affects provenance, filters, and reindexing.
  • Understand adjacency expansion and parent-child retrieval ideas.

Practice

  • Chunk the same document three different ways and compare retrieval quality.
  • Measure how many chunks are created at 256, 512, and 1024 token targets.
  • Test whether titles or headings improve results when added to chunk text.
  • Inspect cases where relevant information is split across chunks.
  • Track which metadata fields are actually used during debugging and audits.

Build

  • Create a chunking pipeline for PDFs, markdown files, or support logs.
  • Store rich metadata beside each chunk.
  • Add versioned indexing so document updates can invalidate old chunks safely.
  • Build a small tool to inspect retrieved chunks with neighboring context.
  • Write a short evaluation memo recommending one chunking policy.