Tokenization and Subword Methods

Beginner

Why tokenization exists

Models do not consume raw strings directly. They consume token IDs. Tokenization is the step that decides how text such as unbelievable, SKU-8472A, or こんにちは gets broken into units that can be mapped to a fixed vocabulary. That makes tokenization one of the earliest and most important design choices in any NLP pipeline.

Word-level tokenization: easy to explain, but every new spelling or rare word creates an out-of-vocabulary problem.
Character-level tokenization: covers everything, but sequences become long and each unit carries little semantic information.
Subword tokenization: splits text into reusable parts so common words can stay compact while rare words are still representable.

Raw text -> tokenizer -> token pieces -> token IDs -> model

"unbelievable" might become:
["un", "believ", "able"]

"SKU-8472A" might become:
["SKU", "-", "847", "2", "A"]

What problem subwords solve

If every distinct word needed its own entry, vocabularies would explode in size because of inflections, derivations, typos, usernames, URLs, and domain-specific identifiers. Subwords let the system reuse pieces it already knows.

examples = {
    "unbelievable": ["un", "believ", "able"],
    "rerunning": ["re", "run", "ning"],
    "bioinformatics": ["bio", "inform", "atics"]
}

for word, pieces in examples.items():
    print(word, "->", pieces)

Real-world example: user-generated text often contains unseen forms such as product codes, slang, and spelling variants. Subword units help the model process these inputs instead of collapsing them into a generic unknown token.

Simple comparison

Approach	Main advantage	Main drawback
Word	Short sequences and intuitive tokens	Huge vocabulary and poor handling of unseen words
Character	Full coverage of any input string	Very long sequences and weaker local meaning per token
Subword	Good balance between coverage and efficiency	Quality depends heavily on tokenizer design and training data

Practical intuition: tokenization is not just preprocessing. It affects how much of the context window you consume, whether rare strings survive intact enough to be useful, and how stable your pipeline is across domains.

Advanced

Core algorithm families

Popular methods include Byte Pair Encoding (BPE), WordPiece, unigram language-model tokenization, and SentencePiece-style training on raw text. They all try to learn a compact vocabulary that keeps common pieces short while still allowing rare forms to be decomposed productively.

BPE: starts from small units and repeatedly merges the most frequent adjacent pairs.
WordPiece: similar in spirit to BPE, but chooses merges using a likelihood-oriented score rather than only raw pair frequency.
Unigram: begins with many candidate pieces, then removes the ones that contribute least to an efficient probabilistic segmentation.
SentencePiece: trains subword models directly on raw text and treats whitespace as an explicit symbol, which is useful for languages without space-delimited words.

How the training logic differs

Method	Training intuition	Common strength	Common tradeoff
BPE	Merge the most frequent adjacent symbols again and again	Simple and widely used	Greedy merges may not be globally optimal
WordPiece	Prefer merges that are especially informative under the corpus statistics	Often produces useful compact pieces	Still depends strongly on training corpus composition
Unigram	Start large, then prune candidate pieces by likelihood impact	Flexible segmentation and sampling variants	More probabilistic and less intuitive to reason about
Byte-level BPE	Use bytes as the base alphabet before learning merges	Can represent arbitrary text without an unknown token	May inflate token counts for some inputs

Corpus
  -> collect frequencies or likelihood statistics
  -> choose vocabulary size
  -> learn merge rules or keep/drop candidate pieces
  -> tokenize new text into pieces
  -> map pieces to IDs

Tokenizer quality directly changes sequence length and coverage.

Mini BPE-style intuition

pieces = ["l", "o", "w", "e", "r"]
merge_rules = [("l", "o"), ("lo", "w"), ("e", "r")]

for left, right in merge_rules:
    merged = left + right
    print(f"merge {left} + {right} -> {merged}")

The real algorithms operate on corpus statistics, not on a single word in isolation, but this captures the basic idea: frequent local patterns become reusable vocabulary items.

What engineers should measure

Token inflation: some domains, especially code, log lines, IDs, and multilingual text, produce far more tokens than you expect from character count alone.
Coverage of rare forms: a tokenizer should split unknown strings into meaningful pieces, not useless fragments whenever possible.
Morphology handling: languages with rich inflection or compounding need tokenizers that do not shatter words into excessively long sequences.
Reproducibility: model weights and tokenizer vocabulary are coupled. Swapping one without the other breaks the system.

Common failure modes

Over-fragmentation: meaningful units get broken into too many pieces, increasing sequence length and weakening signal.
Domain mismatch: a tokenizer trained on generic web text may split biomedical, legal, or programming terms poorly.
Cross-language imbalance: some languages consume many more tokens for the same amount of content, which creates cost and fairness issues in multilingual systems.
Whitespace and formatting surprises: tokenizers that encode spaces, punctuation, or bytes in special ways can make debugging harder if you never inspect the actual pieces.

Keep this module separate from later topics. Tokenization explains how strings become discrete pieces. It is not the same thing as text representation, embeddings, sequence modeling, or transformer internals.

Quick decision framing

If you need robust open-vocabulary behavior, subword tokenization is usually the default. If you need exact, reversible handling of arbitrary text, byte-level or raw-text methods become attractive. If your domain has heavy morphology or no reliable whitespace boundaries, SentencePiece-style training and careful vocabulary-size selection matter more than they do in clean English prose.

To-do list

Learn

Understand the difference between word-level, character-level, and subword tokenization.
Learn the intuition behind BPE, WordPiece, unigram, and SentencePiece-style training.
Study how vocabulary size trades off against sequence length and open-vocabulary coverage.
Understand why tokenization affects cost, truncation risk, and robustness in production text.
Learn the main tokenization failure modes for multilingual, noisy, and identifier-heavy input.

Practice

Tokenize the same sentences with at least two tokenizers and compare the resulting pieces.
Inspect how names, URLs, product IDs, and misspellings get segmented.
Measure token counts for prose, code, tables, and multilingual text to see where inflation appears.
Write down examples where the tokenizer split seems intuitive versus obviously awkward.
Test whether whitespace and punctuation are preserved or encoded in special ways.

Build

Create a small comparison tool that shows token pieces, IDs, and counts for multiple tokenizers.
Build a token-budget estimator for your own prompts, datasets, or API usage patterns.
Document domain-specific tokenization pain points such as legal citations, code, or medical terms.
Train a tiny tokenizer on a sample corpus and inspect how vocabulary size changes segmentation.
Produce a short evaluation note recommending one tokenizer strategy for your target domain.