BERT and Transformer Pretraining

Beginner

BERT is trained by hiding some tokens and asking the model to recover them from both left and right context. That bidirectional setup helped transformers learn much richer representations for sentence classification, token labeling, extractive question answering, and semantic matching than earlier feature-based pipelines.

Why BERT mattered

Pretraining first, supervision second: the model learns generic syntax and semantics from unlabeled text, then adapts with relatively little labeled data.
Bidirectional context: each masked token can use both the left and right side of the sentence, which is especially useful for understanding tasks.
One backbone, many tasks: the same pretrained encoder can be fine-tuned for sentiment analysis, natural language inference, named entity recognition, and extractive QA.
Encoder focus: BERT is built to understand text representations, not to generate long free-form answers token by token.

What the model actually sees

BERT input is more structured than plain text. A sequence is converted into subword tokens, wrapped with special markers, and represented by the sum of token, segment, and position embeddings.

Sentence A            Sentence B
    |                     |
WordPiece tokens     WordPiece tokens
    |                     |
[CLS] ... tokens ... [SEP] ... tokens ... [SEP]
   + token embeddings + segment embeddings + position embeddings

[CLS] is a special classification token whose final representation is often used for sentence-level prediction.
[SEP] separates segments such as sentence pairs.
Segment embeddings tell the model whether a token belongs to sentence A or sentence B.
WordPiece tokenization breaks rare words into reusable subword units so vocabulary size stays manageable.

Core pretraining idea

The central BERT objective is masked language modeling. Some tokens are selected for prediction, and the model learns to infer them from context rather than from hand-built linguistic rules.

sentence = "The capital of France is Paris."
tokens = ["[CLS]", "the", "capital", "of", "france", "is", "[MASK]", ".", "[SEP]"]
target = "paris"

print(tokens)
print("predict:", target)

Real-world example: if you fine-tune BERT for customer-support intent classification, you usually need far less labeled data than if you trained an encoder from scratch because pretraining already gave the model a useful language representation space.

Pretraining and fine-tuning workflow

Large unlabeled text
        |
        v
Masked-language-model pretraining
        |
        v
Pretrained BERT checkpoint
        |
        +--> Add classifier head for sentiment / NLI
        +--> Add token head for tagging / NER
        +--> Add span head for extractive QA

Simple mental model: pretraining builds general-purpose language features; fine-tuning reshapes those features for one concrete downstream objective.

Advanced

Pretraining objectives and training recipes determine what an encoder learns. BERT combined masked language modeling with next sentence prediction, while later work such as RoBERTa showed that longer training, more data, bigger batches, and revised masking strategy could outperform many supposedly architectural improvements.

BERT pretraining objectives

Masked language modeling (MLM): select a subset of tokens and predict the original identity of those tokens from context.
Original masking recipe: BERT selects 15% of tokens for prediction; of those selected tokens, 80% are replaced with [MASK], 10% with a random token, and 10% are left unchanged.
Next sentence prediction (NSP): a sentence-pair objective intended to help the model reason about inter-sentence relationships.
Transfer effect: once pretrained, the encoder can usually be adapted by adding only a small task head rather than designing a custom network per task.

Why the 80/10/10 trick exists: always replacing with [MASK] would make pretraining too different from downstream inference, where [MASK] never appears.

Recipe details that changed after BERT

Model	What changed	Why it mattered
BERT	MLM + NSP with static masking and standard pretraining schedule.	Established the encoder pretrain-then-fine-tune paradigm.
RoBERTa	Removed NSP, trained longer, used larger batches, more data, and dynamic masking.	Showed BERT had been undertrained and that recipe quality mattered a lot.
ALBERT	Parameter sharing and factorized embeddings; replaced NSP with sentence order prediction.	Reduced parameter count while preserving strong transfer performance.
DistilBERT	Compressed a BERT-style model with distillation after pretraining.	Made encoder transfer cheaper at inference time with moderate quality loss.

Static versus dynamic masking

In static masking, each training example tends to reuse the same masked positions across epochs. In dynamic masking, the masked tokens can change each time the example is seen. Dynamic masking exposes the model to more prediction targets and generally uses the corpus more efficiently.

Same sentence across epochs

Static masking:
Epoch 1: The capital of France is [MASK].
Epoch 2: The capital of France is [MASK].

Dynamic masking:
Epoch 1: The capital of [MASK] is Paris.
Epoch 2: The [MASK] of France is Paris.

What gets transferred downstream

BERT-style pretraining gives you contextual token representations, sentence-level pooled representations, and layer-wise features that can be reused in different ways depending on the task.

Sequence classification: use the final [CLS] representation or a pooled variant for labels such as sentiment or topic.
Token classification: read the per-token hidden states for tasks like named entity recognition or part-of-speech tagging.
Sentence-pair reasoning: package two texts with separator tokens for entailment, duplicate detection, or ranking.
Extractive QA: predict answer start and end positions over the input tokens rather than generating an answer freely.

Tokenized corpus -> MLM / sentence-level objective -> pretrained encoder
       |
       +--> frozen features
       +--> full fine-tuning
       +--> task-specific head

Minimal fine-tuning pattern

from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

batch = tokenizer(
    ["the service was excellent", "the refund process was confusing"],
    padding=True,
    truncation=True,
    return_tensors="pt",
)

outputs = model(**batch)
print(outputs.logits.shape)

Limits and common misunderstandings

BERT is not a chat model: encoder pretraining gives strong representations, but it does not make the model naturally autoregressive or instruction-following.
Masking creates a train-test mismatch: downstream inference rarely contains explicit mask tokens.
Corpus quality matters: more text helps only when the text is diverse, clean enough, and reasonably matched to the target domain.
Sequence length constraints matter: classic BERT uses a fixed context window and may struggle on long documents without careful chunking or specialized variants.
Recipe improvements are not always architectural breakthroughs: RoBERTa is the standard cautionary example that training longer and better can beat supposedly newer ideas.

Keep this module separate from general transformer mechanics and decoder-only LLM behavior. The core topic here is encoder pretraining, objective design, and transfer learning around the BERT family.

Understanding BERT matters because it explains the modern transfer-learning playbook: start with a broad self-supervised objective, learn reusable representations at scale, and adapt them efficiently to downstream tasks.

To-do list

Learn

Understand masked language modeling, the 15% masking rule, and why the 80/10/10 replacement scheme is used.
Learn how BERT packages sentence pairs with [CLS], [SEP], segment embeddings, and positional information.
Study the difference between pretraining, task heads, frozen-feature use, and full fine-tuning.
Understand why RoBERTa showed that recipe choices such as data size, masking, and training duration matter so much.

Practice

Tokenize several sentence pairs and inspect where [CLS], [SEP], and segment IDs appear.
Run masked-token prediction examples and check whether the model uses both left and right context sensibly.
Fine-tune a small BERT checkpoint on one classification task and compare it with a linear baseline.
Compare frozen-encoder features against end-to-end fine-tuning on the same dataset.

Build

Create a sentence-classification app backed by a pretrained BERT-family encoder.
Build an experiment sheet comparing BERT, RoBERTa, and a non-pretrained baseline on the same task.
Write a short technical note explaining why encoder pretraining changed NLP transfer learning.
Package a reproducible fine-tuning workflow with saved tokenizer, checkpoint, metrics, and inference script.