Subject 22

BERT and transformer pretraining

Pretraining teaches a transformer broad language patterns on large unlabeled corpora before task-specific adaptation. BERT made bidirectional encoder pretraining practical and showed that one pretrained model could transfer well across many language understanding tasks.

Beginner

BERT is trained by hiding some tokens and asking the model to recover them from both left and right context. That bidirectional setup helped transformers learn much richer representations for sentence classification, token labeling, extractive question answering, and semantic matching than earlier feature-based pipelines.

Why BERT mattered

What the model actually sees

BERT input is more structured than plain text. A sequence is converted into subword tokens, wrapped with special markers, and represented by the sum of token, segment, and position embeddings.

Sentence A            Sentence B
    |                     |
WordPiece tokens     WordPiece tokens
    |                     |
[CLS] ... tokens ... [SEP] ... tokens ... [SEP]
   + token embeddings + segment embeddings + position embeddings

Core pretraining idea

The central BERT objective is masked language modeling. Some tokens are selected for prediction, and the model learns to infer them from context rather than from hand-built linguistic rules.

sentence = "The capital of France is Paris."
tokens = ["[CLS]", "the", "capital", "of", "france", "is", "[MASK]", ".", "[SEP]"]
target = "paris"

print(tokens)
print("predict:", target)

Real-world example: if you fine-tune BERT for customer-support intent classification, you usually need far less labeled data than if you trained an encoder from scratch because pretraining already gave the model a useful language representation space.

Pretraining and fine-tuning workflow

Large unlabeled text
        |
        v
Masked-language-model pretraining
        |
        v
Pretrained BERT checkpoint
        |
        +--> Add classifier head for sentiment / NLI
        +--> Add token head for tagging / NER
        +--> Add span head for extractive QA

Simple mental model: pretraining builds general-purpose language features; fine-tuning reshapes those features for one concrete downstream objective.

Advanced

Pretraining objectives and training recipes determine what an encoder learns. BERT combined masked language modeling with next sentence prediction, while later work such as RoBERTa showed that longer training, more data, bigger batches, and revised masking strategy could outperform many supposedly architectural improvements.

BERT pretraining objectives

Why the 80/10/10 trick exists: always replacing with [MASK] would make pretraining too different from downstream inference, where [MASK] never appears.

Recipe details that changed after BERT

Model What changed Why it mattered
BERT MLM + NSP with static masking and standard pretraining schedule. Established the encoder pretrain-then-fine-tune paradigm.
RoBERTa Removed NSP, trained longer, used larger batches, more data, and dynamic masking. Showed BERT had been undertrained and that recipe quality mattered a lot.
ALBERT Parameter sharing and factorized embeddings; replaced NSP with sentence order prediction. Reduced parameter count while preserving strong transfer performance.
DistilBERT Compressed a BERT-style model with distillation after pretraining. Made encoder transfer cheaper at inference time with moderate quality loss.

Static versus dynamic masking

In static masking, each training example tends to reuse the same masked positions across epochs. In dynamic masking, the masked tokens can change each time the example is seen. Dynamic masking exposes the model to more prediction targets and generally uses the corpus more efficiently.

Same sentence across epochs

Static masking:
Epoch 1: The capital of France is [MASK].
Epoch 2: The capital of France is [MASK].

Dynamic masking:
Epoch 1: The capital of [MASK] is Paris.
Epoch 2: The [MASK] of France is Paris.

What gets transferred downstream

BERT-style pretraining gives you contextual token representations, sentence-level pooled representations, and layer-wise features that can be reused in different ways depending on the task.

Tokenized corpus -> MLM / sentence-level objective -> pretrained encoder
       |
       +--> frozen features
       +--> full fine-tuning
       +--> task-specific head

Minimal fine-tuning pattern

from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

batch = tokenizer(
    ["the service was excellent", "the refund process was confusing"],
    padding=True,
    truncation=True,
    return_tensors="pt",
)

outputs = model(**batch)
print(outputs.logits.shape)

Limits and common misunderstandings

Keep this module separate from general transformer mechanics and decoder-only LLM behavior. The core topic here is encoder pretraining, objective design, and transfer learning around the BERT family.

Understanding BERT matters because it explains the modern transfer-learning playbook: start with a broad self-supervised objective, learn reusable representations at scale, and adapt them efficiently to downstream tasks.

To-do list

Learn

  • Understand masked language modeling, the 15% masking rule, and why the 80/10/10 replacement scheme is used.
  • Learn how BERT packages sentence pairs with [CLS], [SEP], segment embeddings, and positional information.
  • Study the difference between pretraining, task heads, frozen-feature use, and full fine-tuning.
  • Understand why RoBERTa showed that recipe choices such as data size, masking, and training duration matter so much.

Practice

  • Tokenize several sentence pairs and inspect where [CLS], [SEP], and segment IDs appear.
  • Run masked-token prediction examples and check whether the model uses both left and right context sensibly.
  • Fine-tune a small BERT checkpoint on one classification task and compare it with a linear baseline.
  • Compare frozen-encoder features against end-to-end fine-tuning on the same dataset.

Build

  • Create a sentence-classification app backed by a pretrained BERT-family encoder.
  • Build an experiment sheet comparing BERT, RoBERTa, and a non-pretrained baseline on the same task.
  • Write a short technical note explaining why encoder pretraining changed NLP transfer learning.
  • Package a reproducible fine-tuning workflow with saved tokenizer, checkpoint, metrics, and inference script.