Subject 17

Tokenization and subword methods

Tokenization converts raw text into discrete units before the model ever sees an ID, embedding, or attention pattern. Subword methods matter because real language is open-ended: names, misspellings, compounds, code, and multilingual text would make a pure word vocabulary too large and too fragile.

Beginner

Why tokenization exists

Models do not consume raw strings directly. They consume token IDs. Tokenization is the step that decides how text such as unbelievable, SKU-8472A, or ใ“ใ‚“ใซใกใฏ gets broken into units that can be mapped to a fixed vocabulary. That makes tokenization one of the earliest and most important design choices in any NLP pipeline.

Raw text -> tokenizer -> token pieces -> token IDs -> model

"unbelievable" might become:
["un", "believ", "able"]

"SKU-8472A" might become:
["SKU", "-", "847", "2", "A"]

What problem subwords solve

If every distinct word needed its own entry, vocabularies would explode in size because of inflections, derivations, typos, usernames, URLs, and domain-specific identifiers. Subwords let the system reuse pieces it already knows.

examples = {
    "unbelievable": ["un", "believ", "able"],
    "rerunning": ["re", "run", "ning"],
    "bioinformatics": ["bio", "inform", "atics"]
}

for word, pieces in examples.items():
    print(word, "->", pieces)

Real-world example: user-generated text often contains unseen forms such as product codes, slang, and spelling variants. Subword units help the model process these inputs instead of collapsing them into a generic unknown token.

Simple comparison

Approach Main advantage Main drawback
Word Short sequences and intuitive tokens Huge vocabulary and poor handling of unseen words
Character Full coverage of any input string Very long sequences and weaker local meaning per token
Subword Good balance between coverage and efficiency Quality depends heavily on tokenizer design and training data

Practical intuition: tokenization is not just preprocessing. It affects how much of the context window you consume, whether rare strings survive intact enough to be useful, and how stable your pipeline is across domains.

Advanced

Core algorithm families

Popular methods include Byte Pair Encoding (BPE), WordPiece, unigram language-model tokenization, and SentencePiece-style training on raw text. They all try to learn a compact vocabulary that keeps common pieces short while still allowing rare forms to be decomposed productively.

How the training logic differs

Method Training intuition Common strength Common tradeoff
BPE Merge the most frequent adjacent symbols again and again Simple and widely used Greedy merges may not be globally optimal
WordPiece Prefer merges that are especially informative under the corpus statistics Often produces useful compact pieces Still depends strongly on training corpus composition
Unigram Start large, then prune candidate pieces by likelihood impact Flexible segmentation and sampling variants More probabilistic and less intuitive to reason about
Byte-level BPE Use bytes as the base alphabet before learning merges Can represent arbitrary text without an unknown token May inflate token counts for some inputs
Corpus
  -> collect frequencies or likelihood statistics
  -> choose vocabulary size
  -> learn merge rules or keep/drop candidate pieces
  -> tokenize new text into pieces
  -> map pieces to IDs

Tokenizer quality directly changes sequence length and coverage.

Mini BPE-style intuition

pieces = ["l", "o", "w", "e", "r"]
merge_rules = [("l", "o"), ("lo", "w"), ("e", "r")]

for left, right in merge_rules:
    merged = left + right
    print(f"merge {left} + {right} -> {merged}")

The real algorithms operate on corpus statistics, not on a single word in isolation, but this captures the basic idea: frequent local patterns become reusable vocabulary items.

What engineers should measure

Common failure modes

Keep this module separate from later topics. Tokenization explains how strings become discrete pieces. It is not the same thing as text representation, embeddings, sequence modeling, or transformer internals.

Quick decision framing

If you need robust open-vocabulary behavior, subword tokenization is usually the default. If you need exact, reversible handling of arbitrary text, byte-level or raw-text methods become attractive. If your domain has heavy morphology or no reliable whitespace boundaries, SentencePiece-style training and careful vocabulary-size selection matter more than they do in clean English prose.

To-do list

Learn

  • Understand the difference between word-level, character-level, and subword tokenization.
  • Learn the intuition behind BPE, WordPiece, unigram, and SentencePiece-style training.
  • Study how vocabulary size trades off against sequence length and open-vocabulary coverage.
  • Understand why tokenization affects cost, truncation risk, and robustness in production text.
  • Learn the main tokenization failure modes for multilingual, noisy, and identifier-heavy input.

Practice

  • Tokenize the same sentences with at least two tokenizers and compare the resulting pieces.
  • Inspect how names, URLs, product IDs, and misspellings get segmented.
  • Measure token counts for prose, code, tables, and multilingual text to see where inflation appears.
  • Write down examples where the tokenizer split seems intuitive versus obviously awkward.
  • Test whether whitespace and punctuation are preserved or encoded in special ways.

Build

  • Create a small comparison tool that shows token pieces, IDs, and counts for multiple tokenizers.
  • Build a token-budget estimator for your own prompts, datasets, or API usage patterns.
  • Document domain-specific tokenization pain points such as legal citations, code, or medical terms.
  • Train a tiny tokenizer on a sample corpus and inspect how vocabulary size changes segmentation.
  • Produce a short evaluation note recommending one tokenizer strategy for your target domain.