Beginner
Why tokenization exists
Models do not consume raw strings directly. They consume token IDs. Tokenization is the step that decides how text such as unbelievable, SKU-8472A, or ใใใซใกใฏ gets broken into units that can be mapped to a fixed vocabulary. That makes tokenization one of the earliest and most important design choices in any NLP pipeline.
- Word-level tokenization: easy to explain, but every new spelling or rare word creates an out-of-vocabulary problem.
- Character-level tokenization: covers everything, but sequences become long and each unit carries little semantic information.
- Subword tokenization: splits text into reusable parts so common words can stay compact while rare words are still representable.
Raw text -> tokenizer -> token pieces -> token IDs -> model "unbelievable" might become: ["un", "believ", "able"] "SKU-8472A" might become: ["SKU", "-", "847", "2", "A"]
What problem subwords solve
If every distinct word needed its own entry, vocabularies would explode in size because of inflections, derivations, typos, usernames, URLs, and domain-specific identifiers. Subwords let the system reuse pieces it already knows.
examples = {
"unbelievable": ["un", "believ", "able"],
"rerunning": ["re", "run", "ning"],
"bioinformatics": ["bio", "inform", "atics"]
}
for word, pieces in examples.items():
print(word, "->", pieces)
Real-world example: user-generated text often contains unseen forms such as product codes, slang, and spelling variants. Subword units help the model process these inputs instead of collapsing them into a generic unknown token.
Simple comparison
| Approach | Main advantage | Main drawback |
|---|---|---|
| Word | Short sequences and intuitive tokens | Huge vocabulary and poor handling of unseen words |
| Character | Full coverage of any input string | Very long sequences and weaker local meaning per token |
| Subword | Good balance between coverage and efficiency | Quality depends heavily on tokenizer design and training data |
Practical intuition: tokenization is not just preprocessing. It affects how much of the context window you consume, whether rare strings survive intact enough to be useful, and how stable your pipeline is across domains.
Advanced
Core algorithm families
Popular methods include Byte Pair Encoding (BPE), WordPiece, unigram language-model tokenization, and SentencePiece-style training on raw text. They all try to learn a compact vocabulary that keeps common pieces short while still allowing rare forms to be decomposed productively.
- BPE: starts from small units and repeatedly merges the most frequent adjacent pairs.
- WordPiece: similar in spirit to BPE, but chooses merges using a likelihood-oriented score rather than only raw pair frequency.
- Unigram: begins with many candidate pieces, then removes the ones that contribute least to an efficient probabilistic segmentation.
- SentencePiece: trains subword models directly on raw text and treats whitespace as an explicit symbol, which is useful for languages without space-delimited words.
How the training logic differs
| Method | Training intuition | Common strength | Common tradeoff |
|---|---|---|---|
| BPE | Merge the most frequent adjacent symbols again and again | Simple and widely used | Greedy merges may not be globally optimal |
| WordPiece | Prefer merges that are especially informative under the corpus statistics | Often produces useful compact pieces | Still depends strongly on training corpus composition |
| Unigram | Start large, then prune candidate pieces by likelihood impact | Flexible segmentation and sampling variants | More probabilistic and less intuitive to reason about |
| Byte-level BPE | Use bytes as the base alphabet before learning merges | Can represent arbitrary text without an unknown token | May inflate token counts for some inputs |
Corpus -> collect frequencies or likelihood statistics -> choose vocabulary size -> learn merge rules or keep/drop candidate pieces -> tokenize new text into pieces -> map pieces to IDs Tokenizer quality directly changes sequence length and coverage.
Mini BPE-style intuition
pieces = ["l", "o", "w", "e", "r"]
merge_rules = [("l", "o"), ("lo", "w"), ("e", "r")]
for left, right in merge_rules:
merged = left + right
print(f"merge {left} + {right} -> {merged}")
The real algorithms operate on corpus statistics, not on a single word in isolation, but this captures the basic idea: frequent local patterns become reusable vocabulary items.
What engineers should measure
- Token inflation: some domains, especially code, log lines, IDs, and multilingual text, produce far more tokens than you expect from character count alone.
- Coverage of rare forms: a tokenizer should split unknown strings into meaningful pieces, not useless fragments whenever possible.
- Morphology handling: languages with rich inflection or compounding need tokenizers that do not shatter words into excessively long sequences.
- Reproducibility: model weights and tokenizer vocabulary are coupled. Swapping one without the other breaks the system.
Common failure modes
- Over-fragmentation: meaningful units get broken into too many pieces, increasing sequence length and weakening signal.
- Domain mismatch: a tokenizer trained on generic web text may split biomedical, legal, or programming terms poorly.
- Cross-language imbalance: some languages consume many more tokens for the same amount of content, which creates cost and fairness issues in multilingual systems.
- Whitespace and formatting surprises: tokenizers that encode spaces, punctuation, or bytes in special ways can make debugging harder if you never inspect the actual pieces.
Keep this module separate from later topics. Tokenization explains how strings become discrete pieces. It is not the same thing as text representation, embeddings, sequence modeling, or transformer internals.
Quick decision framing
If you need robust open-vocabulary behavior, subword tokenization is usually the default. If you need exact, reversible handling of arbitrary text, byte-level or raw-text methods become attractive. If your domain has heavy morphology or no reliable whitespace boundaries, SentencePiece-style training and careful vocabulary-size selection matter more than they do in clean English prose.
To-do list
Learn
- Understand the difference between word-level, character-level, and subword tokenization.
- Learn the intuition behind BPE, WordPiece, unigram, and SentencePiece-style training.
- Study how vocabulary size trades off against sequence length and open-vocabulary coverage.
- Understand why tokenization affects cost, truncation risk, and robustness in production text.
- Learn the main tokenization failure modes for multilingual, noisy, and identifier-heavy input.
Practice
- Tokenize the same sentences with at least two tokenizers and compare the resulting pieces.
- Inspect how names, URLs, product IDs, and misspellings get segmented.
- Measure token counts for prose, code, tables, and multilingual text to see where inflation appears.
- Write down examples where the tokenizer split seems intuitive versus obviously awkward.
- Test whether whitespace and punctuation are preserved or encoded in special ways.
Build
- Create a small comparison tool that shows token pieces, IDs, and counts for multiple tokenizers.
- Build a token-budget estimator for your own prompts, datasets, or API usage patterns.
- Document domain-specific tokenization pain points such as legal citations, code, or medical terms.
- Train a tiny tokenizer on a sample corpus and inspect how vocabulary size changes segmentation.
- Produce a short evaluation note recommending one tokenizer strategy for your target domain.