Document Chunking and Indexing

Beginner

Why chunking exists

A long document is usually too large and too mixed in topic to store as one searchable unit. Chunking breaks it into smaller pieces so the system can match a question to the part of the document that actually answers it.

The goal is not just to fit model limits. Good chunks also improve precision. A 20-page handbook embedded as one block produces a blurry representation of many topics at once, while smaller coherent chunks preserve one idea, section, or procedure at a time.

Core trade-offs

Chunks that are too small lose context.
Chunks that are too large dilute relevance and waste prompt tokens.
Overlap helps preserve meaning when ideas span boundaries.
Chunk boundaries should usually follow document structure before raw character counts.
Already-short records such as FAQs, tickets, or product rows may not need chunking at all.

Good chunking target

One chunk = one self-contained idea
  -> understandable on its own
  -> small enough to retrieve precisely
  -> large enough to answer a question
  -> linked back to document and neighbors

Simple fixed-size chunking

Fixed-size chunking is the fastest baseline. It works well when documents are messy or inconsistent, but it can split through sentences or headings. In practice, many teams start with token-based or character-based chunking, then move to structure-aware rules if quality is weak.

def chunk_text(text, size=80, overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

print(chunk_text("A" * 220, size=50, overlap=10))

Useful starting point: for long prose, a common first pass is roughly 300-600 tokens per chunk with 10-25% overlap, then adjust based on your documents and queries. Structured documents often need less overlap because headings already preserve context.

What indexing means here

After chunking, each chunk becomes an indexable record. Indexing is the step where you assign identifiers, capture metadata, and store the chunk in a searchable form. Even a strong chunking policy becomes hard to use if the index cannot tell you which document a chunk came from, where it appeared, or which nearby chunks sit before and after it.

Document -> parsed sections -> chunks -> chunk records -> searchable index

Chunk record usually contains:
  chunk_id
  doc_id
  chunk_index
  section_title
  text
  source path or URL
  timestamps / tenant / permissions

Real-world example: a product specification document may mention battery life in one section and environmental operating range in another. Good chunking plus clean indexing helps the system retrieve the right section instead of the whole manual, and still lets you trace the answer back to its original source.

Advanced

Structure-aware and semantic chunking

Useful chunking is structure-aware. Instead of splitting blindly every N characters, prefer boundaries such as headings, paragraphs, tables, code blocks, list items, or speaker turns. For dense material, semantic chunking can go further by detecting topic shifts across sentences rather than relying only on formatting.

Common strategies

Fixed-size         -> fastest baseline, weakest boundaries
Recursive split    -> uses paragraphs/sentences first, size limits second
Structure-aware    -> follows headings, tables, code, speaker turns
Semantic           -> groups sentences by meaning shifts
Hierarchical       -> keeps parent/child chunks at multiple sizes

Hierarchical chunking is especially useful for large manuals, textbooks, or policies. You may store a large parent section for context and smaller child chunks for precision. The index must then preserve those relationships so a system can retrieve the narrow child chunk and optionally expand to the parent when more context is needed.

Metadata design

Rich metadata can be as important as the text itself. Metadata supports traceability, filtering, access control, reindexing, and chunk expansion.

Store document ID, title, section heading, timestamp, tenant, and source URL.
Track chunk order so adjacent expansion is possible.
Keep raw text plus normalized text when preprocessing changes formatting.
Store parser version or chunking policy version so you can rebuild the index safely later.
Keep parent-child links when using hierarchical chunking.
Consider a second indexing path for titles, headings, and exact entities.

Raw document
  -> parser
  -> structural segments
  -> chunk builder
  -> metadata enrichment
				  -> indexing pipeline
  -> searchable corpus

Index record example

chunk_record = {
    "doc_id": "policy-2026-03",
    "chunk_id": "policy-2026-03::4",
    "section": "Vacation Accrual",
    "chunk_index": 4,
    "parent_section_id": "vacation-accrual",
    "text": "Full-time employees accrue 20 vacation days annually.",
    "source_url": "hr/policies/vacation",
    "created_at": "2026-03-08",
    "chunk_version": "v2"
}

print(chunk_record)

Failure modes to watch

Boundary breaks: a definition starts in one chunk and its key qualifier lands in the next.
Noisy chunks: one chunk mixes unrelated topics, so its representation becomes ambiguous.
Lost context: abbreviations such as "it" or "this policy" become meaningless outside the parent section.
Broken provenance: the system retrieves text but cannot point back to the exact source section.
Stale indexes: documents change, but old chunks remain searchable because versioning is weak.

Choosing a policy in practice

If documents are short and atomic
  -> do not chunk, index each document directly

If documents are long but well structured
  -> split by headings/paragraphs, then cap size

If documents are messy or OCR-heavy
  -> start with recursive or fixed-size chunking

If questions need both precision and broad context
  -> store child chunks plus parent sections

def build_chunk_records(doc_id, title, sections):
    records = []
    for section_index, section in enumerate(sections):
        heading = section["heading"]
        paragraphs = section["paragraphs"]

        for chunk_index, paragraph in enumerate(paragraphs):
            records.append({
                "chunk_id": f"{doc_id}::{section_index}::{chunk_index}",
                "doc_id": doc_id,
                "title": title,
                "section_heading": heading,
                "chunk_index": chunk_index,
                "text": paragraph,
            })
    return records

Indexing should support downstream behavior, not just storage. If users filter by region, date, author, product line, or access tier, those fields must be attached during indexing rather than reconstructed later. Strong chunking and strong indexing work together: one shapes the text unit, the other preserves its identity and context.

Practical rule: inspect retrieved chunks manually. If a chunk does not make sense on its own, or if three neighboring chunks are almost always needed together, your chunk size, overlap, or boundary policy probably needs revision.

To-do list

Learn

Understand size, overlap, and structure-aware chunking trade-offs.
Learn when not to chunk and when to use hierarchical parent-child chunks.
Learn which metadata fields help retrieval and auditing.
Study how index record design affects provenance, filters, and reindexing.
Understand adjacency expansion and parent-child retrieval ideas.

Practice

Chunk the same document three different ways and compare retrieval quality.
Measure how many chunks are created at 256, 512, and 1024 token targets.
Test whether titles or headings improve results when added to chunk text.
Inspect cases where relevant information is split across chunks.
Track which metadata fields are actually used during debugging and audits.

Build

Create a chunking pipeline for PDFs, markdown files, or support logs.
Store rich metadata beside each chunk.
Add versioned indexing so document updates can invalidate old chunks safely.
Build a small tool to inspect retrieved chunks with neighboring context.
Write a short evaluation memo recommending one chunking policy.