Beginner
Why chunking exists
A long document is usually too large and too mixed in topic to store as one searchable unit. Chunking breaks it into smaller pieces so the system can match a question to the part of the document that actually answers it.
The goal is not just to fit model limits. Good chunks also improve precision. A 20-page handbook embedded as one block produces a blurry representation of many topics at once, while smaller coherent chunks preserve one idea, section, or procedure at a time.
Core trade-offs
- Chunks that are too small lose context.
- Chunks that are too large dilute relevance and waste prompt tokens.
- Overlap helps preserve meaning when ideas span boundaries.
- Chunk boundaries should usually follow document structure before raw character counts.
- Already-short records such as FAQs, tickets, or product rows may not need chunking at all.
Good chunking target One chunk = one self-contained idea -> understandable on its own -> small enough to retrieve precisely -> large enough to answer a question -> linked back to document and neighbors
Simple fixed-size chunking
Fixed-size chunking is the fastest baseline. It works well when documents are messy or inconsistent, but it can split through sentences or headings. In practice, many teams start with token-based or character-based chunking, then move to structure-aware rules if quality is weak.
def chunk_text(text, size=80, overlap=20):
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start += size - overlap
return chunks
print(chunk_text("A" * 220, size=50, overlap=10))
Useful starting point: for long prose, a common first pass is roughly 300-600 tokens per chunk with 10-25% overlap, then adjust based on your documents and queries. Structured documents often need less overlap because headings already preserve context.
What indexing means here
After chunking, each chunk becomes an indexable record. Indexing is the step where you assign identifiers, capture metadata, and store the chunk in a searchable form. Even a strong chunking policy becomes hard to use if the index cannot tell you which document a chunk came from, where it appeared, or which nearby chunks sit before and after it.
Document -> parsed sections -> chunks -> chunk records -> searchable index Chunk record usually contains: chunk_id doc_id chunk_index section_title text source path or URL timestamps / tenant / permissions
Real-world example: a product specification document may mention battery life in one section and environmental operating range in another. Good chunking plus clean indexing helps the system retrieve the right section instead of the whole manual, and still lets you trace the answer back to its original source.
Advanced
Structure-aware and semantic chunking
Useful chunking is structure-aware. Instead of splitting blindly every N characters, prefer boundaries such as headings, paragraphs, tables, code blocks, list items, or speaker turns. For dense material, semantic chunking can go further by detecting topic shifts across sentences rather than relying only on formatting.
Common strategies Fixed-size -> fastest baseline, weakest boundaries Recursive split -> uses paragraphs/sentences first, size limits second Structure-aware -> follows headings, tables, code, speaker turns Semantic -> groups sentences by meaning shifts Hierarchical -> keeps parent/child chunks at multiple sizes
Hierarchical chunking is especially useful for large manuals, textbooks, or policies. You may store a large parent section for context and smaller child chunks for precision. The index must then preserve those relationships so a system can retrieve the narrow child chunk and optionally expand to the parent when more context is needed.
Metadata design
Rich metadata can be as important as the text itself. Metadata supports traceability, filtering, access control, reindexing, and chunk expansion.
- Store document ID, title, section heading, timestamp, tenant, and source URL.
- Track chunk order so adjacent expansion is possible.
- Keep raw text plus normalized text when preprocessing changes formatting.
- Store parser version or chunking policy version so you can rebuild the index safely later.
- Keep parent-child links when using hierarchical chunking.
- Consider a second indexing path for titles, headings, and exact entities.
Raw document -> parser -> structural segments -> chunk builder -> metadata enrichment -> indexing pipeline -> searchable corpus
Index record example
chunk_record = {
"doc_id": "policy-2026-03",
"chunk_id": "policy-2026-03::4",
"section": "Vacation Accrual",
"chunk_index": 4,
"parent_section_id": "vacation-accrual",
"text": "Full-time employees accrue 20 vacation days annually.",
"source_url": "hr/policies/vacation",
"created_at": "2026-03-08",
"chunk_version": "v2"
}
print(chunk_record)
Failure modes to watch
- Boundary breaks: a definition starts in one chunk and its key qualifier lands in the next.
- Noisy chunks: one chunk mixes unrelated topics, so its representation becomes ambiguous.
- Lost context: abbreviations such as "it" or "this policy" become meaningless outside the parent section.
- Broken provenance: the system retrieves text but cannot point back to the exact source section.
- Stale indexes: documents change, but old chunks remain searchable because versioning is weak.
Choosing a policy in practice
If documents are short and atomic -> do not chunk, index each document directly If documents are long but well structured -> split by headings/paragraphs, then cap size If documents are messy or OCR-heavy -> start with recursive or fixed-size chunking If questions need both precision and broad context -> store child chunks plus parent sections
def build_chunk_records(doc_id, title, sections):
records = []
for section_index, section in enumerate(sections):
heading = section["heading"]
paragraphs = section["paragraphs"]
for chunk_index, paragraph in enumerate(paragraphs):
records.append({
"chunk_id": f"{doc_id}::{section_index}::{chunk_index}",
"doc_id": doc_id,
"title": title,
"section_heading": heading,
"chunk_index": chunk_index,
"text": paragraph,
})
return records
Indexing should support downstream behavior, not just storage. If users filter by region, date, author, product line, or access tier, those fields must be attached during indexing rather than reconstructed later. Strong chunking and strong indexing work together: one shapes the text unit, the other preserves its identity and context.
Practical rule: inspect retrieved chunks manually. If a chunk does not make sense on its own, or if three neighboring chunks are almost always needed together, your chunk size, overlap, or boundary policy probably needs revision.
To-do list
Learn
- Understand size, overlap, and structure-aware chunking trade-offs.
- Learn when not to chunk and when to use hierarchical parent-child chunks.
- Learn which metadata fields help retrieval and auditing.
- Study how index record design affects provenance, filters, and reindexing.
- Understand adjacency expansion and parent-child retrieval ideas.
Practice
- Chunk the same document three different ways and compare retrieval quality.
- Measure how many chunks are created at 256, 512, and 1024 token targets.
- Test whether titles or headings improve results when added to chunk text.
- Inspect cases where relevant information is split across chunks.
- Track which metadata fields are actually used during debugging and audits.
Build
- Create a chunking pipeline for PDFs, markdown files, or support logs.
- Store rich metadata beside each chunk.
- Add versioned indexing so document updates can invalidate old chunks safely.
- Build a small tool to inspect retrieved chunks with neighboring context.
- Write a short evaluation memo recommending one chunking policy.