Subject 21

Transformers

Transformers are sequence models built from stacked attention and feed-forward blocks instead of recurrence. They became the default deep-learning architecture for language because they model token interactions directly, train efficiently in parallel, and support encoder-only, decoder-only, and encoder-decoder designs.

Beginner

A transformer reads a sequence as a set of token representations that repeatedly exchange information across layers. Instead of carrying one hidden state forward step by step like an RNN, each layer lets a token update itself using other tokens in the sequence plus its own current representation. That makes long-range interactions easier to model and lets training process many positions at once.

Why transformers replaced recurrence

High-level data flow

The model first turns tokens into vectors, injects position information, and then passes those vectors through a stack of transformer blocks. Each block has two broad jobs: mix information across tokens, then transform each token representation independently with a small feed-forward network.

Input text
   -> tokenization
   -> token embeddings + positional information
   -> transformer block 1
   -> transformer block 2
   -> ...
   -> final contextual representations
   -> task head or next-token head

What a transformer block is doing

A block alternates between communication and computation. In the communication step, token representations borrow information from other positions. In the computation step, each position is transformed independently into a richer local representation. Stacking many blocks lets the model build higher-level abstractions layer by layer.

components = [
    "token embeddings",
    "positional information",
    "attention sublayer",
    "feed-forward sublayer",
    "residual connections",
    "layer normalization",
]

print(components)

Three common transformer families

Family Main structure Best for
Encoder-only Stack of encoder blocks Classification, tagging, retrieval-style representations, sentence understanding
Decoder-only Stack of causal decoder blocks Autoregressive next-token generation
Encoder-decoder Encoder stack plus decoder stack Tasks that generate an output conditioned on an input sequence
Encoder-only:      input -> contextual representations

Decoder-only:     prompt -> next token -> next token -> next token

Encoder-decoder:  source -> encoder states -> decoder -> target sequence

Real-world example: a transformer can connect a pronoun to its likely referent several tokens away, detect the broader sentence context, and then expose that context to a classifier or generator through the final hidden states.

Scope note: this module is about transformer architecture and design logic. Keep detailed attention math in the attention module, pretraining objectives in the BERT/pretraining module, and large-model deployment concerns in the LLM and serving modules.

Advanced

At advanced depth, a transformer is best viewed as a repeated architecture template for representation mixing. Each block combines cross-token interaction, per-token nonlinear transformation, residual addition, and normalization. The architecture became dominant not because one component was magical in isolation, but because the whole stack trained well, scaled well, and generalized across many sequence problems.

Inside a modern transformer block

Input states
   -> normalization
   -> attention-based mixing
   -> residual add
   -> normalization
   -> feed-forward transformation
   -> residual add
   -> output states
transformer_block = {
    "mix_across_tokens": "attention sublayer",
    "transform_each_token": "feed-forward network",
    "stability": ["residual path", "layer normalization"],
    "visibility_rule": "bidirectional or causal mask",
}

print(transformer_block)

Encoder, decoder, and encoder-decoder as architectural choices

The biggest design decision is not the number of layers first. It is what information flow the task needs.

Why the architecture scaled so well

Property Why it helped
Parallelizable training Much more hardware-friendly than pure recurrence over long sequences
Residual block design Supported deeper stacks and more stable optimization
Architecture reuse One general template could be adapted to understanding and generation settings
Contextual representations Token meaning changes with context rather than staying fixed

Training-time and inference-time behavior

Transformers are often discussed as if they behave identically in all settings, but training and inference expose different constraints.

Main weaknesses and design limits

Why long inputs are hard

More tokens
   -> more token-to-token comparisons
   -> more memory and compute
   -> harder batching and slower throughput

Keep the boundaries clear: do not collapse transformers into attention math, BERT recipes, scaling-law analysis, or LLM application patterns. Those topics build on transformers, but the core topic here is the architecture template itself.

Understanding transformers well means being able to explain what each block contributes, why encoder and decoder variants differ, and why the architecture became a general foundation rather than a one-off translation model.

Study Guide: Transformers, BERT, and GPT

Transformers are still neural networks trained with the same ideas you've learned β€” feedforward layers, activation functions, loss, and backpropagation. The key difference is how information flows between tokens.

1. Why Transformers Were Introduced

Before transformers, NLP relied on RNNs, LSTMs, and GRUs. These had key problems:

Transformers (introduced in "Attention Is All You Need") solve this by replacing recurrence with attention β€” words in a sentence directly attend to other words.

Sentence: "The animal didn't cross the street because it was tired."
"it" β†’ should attend to β†’ "animal"

2. Transformer Architecture Overview

Each transformer block has two main parts: a self-attention layer and a feedforward network.

Input Tokens
  -> Token Embeddings
  -> Positional Encoding
  -> Self-Attention
  -> Feedforward Network
  -> Output

3. Self-Attention

Self-attention allows every word to look at all other words. Each token creates three vectors:

Attention(Q, K, V) = softmax(QKα΅€ / √d) Β· V

4. Multi-Head Attention

Instead of one attention calculation, transformers run multiple attention heads in parallel. Different heads learn different relationships β€” grammar, subject–verb agreement, coreference, semantic similarity.

5. Positional Encoding

Transformers have no built-in sense of order, so positional encodings are added to token embeddings:

embedding(word) + positional_encoding(position)

6. Encoder vs Decoder

Type Used for Example model
Encoder Understanding (classification, NER, sentiment) BERT
Decoder Generation (chatbots, summarization) GPT

7. BERT (Encoder-Only)

BERT reads text bidirectionally β€” it sees left and right context simultaneously.

Pretraining tasks:

8. GPT (Decoder-Only)

GPT predicts the next token given all previous tokens β€” it reads left to right only.

Training: "The cat sat on the" β†’ predict "mat"
Objective: maximize P(next_token | previous_tokens)

Common uses: chatbots, text generation, summarization, coding assistants, translation.

9. BERT vs GPT Comparison

Feature BERT GPT
Architecture Encoder Decoder
Reading direction Bidirectional Left β†’ Right
Main task Understanding Generation
Training objective Masked token prediction Next-token prediction

To-do list

Learn

  • Understand the full transformer pipeline from token embeddings through stacked blocks to task outputs.
  • Learn the role of attention sublayers, feed-forward layers, residual paths, normalization, and masking.
  • Study the difference between encoder-only, decoder-only, and encoder-decoder architectures.
  • Understand why positional information is required in transformer models.
  • Know why transformers train efficiently yet still face long-sequence cost challenges.

Practice

  • Trace one input sentence through embeddings, positional encoding, one transformer block, and an output head.
  • Explain in your own words what changes when the same block is used in encoder-only versus decoder-only mode.
  • Compare transformer parallel training with recurrent step-by-step processing on the same toy sequence.
  • Read a minimal transformer implementation and label where attention, normalization, residuals, and the MLP appear.
  • Draw the visibility pattern for bidirectional processing versus causal generation.

Build

  • Create a one-page architecture note that explains a transformer block from memory with a clean diagram.
  • Run a tiny transformer example and log tensor shapes after embeddings, one block, and the output layer.
  • Build a comparison sheet showing when to choose encoder-only, decoder-only, or encoder-decoder designs.
  • Write a short technical note describing one long-context limitation and why it arises from the architecture.
  • Implement or inspect a toy block class and map each line to its architectural role.