Beginner
A transformer reads a sequence as a set of token representations that repeatedly exchange information across layers. Instead of carrying one hidden state forward step by step like an RNN, each layer lets a token update itself using other tokens in the sequence plus its own current representation. That makes long-range interactions easier to model and lets training process many positions at once.
Why transformers replaced recurrence
- Direct token interaction: important words can influence each other without passing through a long recurrent chain.
- Parallel training: many sequence positions can be processed simultaneously during training.
- Stable scaling: stacked residual blocks and normalization made it easier to train deeper, larger models.
- Flexible architecture: the same block pattern works for understanding, generation, and conditional generation.
High-level data flow
The model first turns tokens into vectors, injects position information, and then passes those vectors through a stack of transformer blocks. Each block has two broad jobs: mix information across tokens, then transform each token representation independently with a small feed-forward network.
- Embedding layer: maps token IDs into dense vectors.
- Positional signal: gives the model information about order because token mixing alone is permutation-insensitive.
- Attention sublayer: lets each token gather context from other visible tokens.
- Feed-forward sublayer: applies a learned nonlinear transformation to each token position.
- Residual paths: preserve and refine information instead of overwriting it at every layer.
Input text -> tokenization -> token embeddings + positional information -> transformer block 1 -> transformer block 2 -> ... -> final contextual representations -> task head or next-token head
What a transformer block is doing
A block alternates between communication and computation. In the communication step, token representations borrow information from other positions. In the computation step, each position is transformed independently into a richer local representation. Stacking many blocks lets the model build higher-level abstractions layer by layer.
components = [
"token embeddings",
"positional information",
"attention sublayer",
"feed-forward sublayer",
"residual connections",
"layer normalization",
]
print(components)
Three common transformer families
| Family | Main structure | Best for |
|---|---|---|
| Encoder-only | Stack of encoder blocks | Classification, tagging, retrieval-style representations, sentence understanding |
| Decoder-only | Stack of causal decoder blocks | Autoregressive next-token generation |
| Encoder-decoder | Encoder stack plus decoder stack | Tasks that generate an output conditioned on an input sequence |
Encoder-only: input -> contextual representations Decoder-only: prompt -> next token -> next token -> next token Encoder-decoder: source -> encoder states -> decoder -> target sequence
Real-world example: a transformer can connect a pronoun to its likely referent several tokens away, detect the broader sentence context, and then expose that context to a classifier or generator through the final hidden states.
Scope note: this module is about transformer architecture and design logic. Keep detailed attention math in the attention module, pretraining objectives in the BERT/pretraining module, and large-model deployment concerns in the LLM and serving modules.
Advanced
At advanced depth, a transformer is best viewed as a repeated architecture template for representation mixing. Each block combines cross-token interaction, per-token nonlinear transformation, residual addition, and normalization. The architecture became dominant not because one component was magical in isolation, but because the whole stack trained well, scaled well, and generalized across many sequence problems.
Inside a modern transformer block
- Attention sublayer: mixes information across visible positions so token representations become context-dependent.
- Feed-forward network: usually a two-layer position-wise MLP that expands and projects representations.
- Residual connections: help optimization by preserving a direct path for information and gradients.
- Layer normalization: stabilizes training and keeps activation scales manageable.
- Masking rules: determine which positions are visible, which is what separates encoder-style bidirectional processing from decoder-style causal generation.
Input states -> normalization -> attention-based mixing -> residual add -> normalization -> feed-forward transformation -> residual add -> output states
transformer_block = {
"mix_across_tokens": "attention sublayer",
"transform_each_token": "feed-forward network",
"stability": ["residual path", "layer normalization"],
"visibility_rule": "bidirectional or causal mask",
}
print(transformer_block)
Encoder, decoder, and encoder-decoder as architectural choices
The biggest design decision is not the number of layers first. It is what information flow the task needs.
- Encoder-only: every token can usually use both left and right context, which is ideal when the goal is to build strong representations of the whole input.
- Decoder-only: each position can use only earlier tokens, which matches left-to-right generation.
- Encoder-decoder: the decoder generates autoregressively while also consulting encoded source representations, which suits input-to-output mapping.
Why the architecture scaled so well
| Property | Why it helped |
|---|---|
| Parallelizable training | Much more hardware-friendly than pure recurrence over long sequences |
| Residual block design | Supported deeper stacks and more stable optimization |
| Architecture reuse | One general template could be adapted to understanding and generation settings |
| Contextual representations | Token meaning changes with context rather than staying fixed |
Training-time and inference-time behavior
Transformers are often discussed as if they behave identically in all settings, but training and inference expose different constraints.
- Training: many tokens can usually be processed together in batches, which is where the architecture gains much of its efficiency.
- Autoregressive inference: decoder-style generation is still sequential over output positions even though the internal block is parallel within a step.
- Representation tasks: encoder-style models can score or encode an entire input in one forward pass.
Main weaknesses and design limits
- Long-context cost: full token-to-token interaction becomes expensive as sequence length grows.
- Positional dependence: order has to be injected explicitly; it is not built into attention by default.
- Memory pressure: deep stacks and long sequences can make training and serving resource-intensive.
- Architecture is not objective: a transformer tells you the network structure, not how it was pretrained or aligned.
Why long inputs are hard More tokens -> more token-to-token comparisons -> more memory and compute -> harder batching and slower throughput
Keep the boundaries clear: do not collapse transformers into attention math, BERT recipes, scaling-law analysis, or LLM application patterns. Those topics build on transformers, but the core topic here is the architecture template itself.
Understanding transformers well means being able to explain what each block contributes, why encoder and decoder variants differ, and why the architecture became a general foundation rather than a one-off translation model.
Study Guide: Transformers, BERT, and GPT
Transformers are still neural networks trained with the same ideas you've learned β feedforward layers, activation functions, loss, and backpropagation. The key difference is how information flows between tokens.
1. Why Transformers Were Introduced
Before transformers, NLP relied on RNNs, LSTMs, and GRUs. These had key problems:
- Hard to parallelize
- Struggle with long-range dependencies
- Slow training on long sequences
Transformers (introduced in "Attention Is All You Need") solve this by replacing recurrence with attention β words in a sentence directly attend to other words.
Sentence: "The animal didn't cross the street because it was tired." "it" β should attend to β "animal"
2. Transformer Architecture Overview
Each transformer block has two main parts: a self-attention layer and a feedforward network.
Input Tokens -> Token Embeddings -> Positional Encoding -> Self-Attention -> Feedforward Network -> Output
3. Self-Attention
Self-attention allows every word to look at all other words. Each token creates three vectors:
- Query (Q) β what am I looking for?
- Key (K) β what does each word contain?
- Value (V) β the information to pass forward
Attention(Q, K, V) = softmax(QKα΅ / βd) Β· V
4. Multi-Head Attention
Instead of one attention calculation, transformers run multiple attention heads in parallel. Different heads learn different relationships β grammar, subjectβverb agreement, coreference, semantic similarity.
5. Positional Encoding
Transformers have no built-in sense of order, so positional encodings are added to token embeddings:
embedding(word) + positional_encoding(position)
6. Encoder vs Decoder
| Type | Used for | Example model |
|---|---|---|
| Encoder | Understanding (classification, NER, sentiment) | BERT |
| Decoder | Generation (chatbots, summarization) | GPT |
7. BERT (Encoder-Only)
BERT reads text bidirectionally β it sees left and right context simultaneously.
Pretraining tasks:
- Masked Language Modeling (MLM): some tokens are hidden and the model predicts them.
"The cat sat on the [MASK]" β "mat"
- Next Sentence Prediction (NSP): model learns if sentence B logically follows sentence A.
8. GPT (Decoder-Only)
GPT predicts the next token given all previous tokens β it reads left to right only.
Training: "The cat sat on the" β predict "mat" Objective: maximize P(next_token | previous_tokens)
Common uses: chatbots, text generation, summarization, coding assistants, translation.
9. BERT vs GPT Comparison
| Feature | BERT | GPT |
|---|---|---|
| Architecture | Encoder | Decoder |
| Reading direction | Bidirectional | Left β Right |
| Main task | Understanding | Generation |
| Training objective | Masked token prediction | Next-token prediction |
To-do list
Learn
- Understand the full transformer pipeline from token embeddings through stacked blocks to task outputs.
- Learn the role of attention sublayers, feed-forward layers, residual paths, normalization, and masking.
- Study the difference between encoder-only, decoder-only, and encoder-decoder architectures.
- Understand why positional information is required in transformer models.
- Know why transformers train efficiently yet still face long-sequence cost challenges.
Practice
- Trace one input sentence through embeddings, positional encoding, one transformer block, and an output head.
- Explain in your own words what changes when the same block is used in encoder-only versus decoder-only mode.
- Compare transformer parallel training with recurrent step-by-step processing on the same toy sequence.
- Read a minimal transformer implementation and label where attention, normalization, residuals, and the MLP appear.
- Draw the visibility pattern for bidirectional processing versus causal generation.
Build
- Create a one-page architecture note that explains a transformer block from memory with a clean diagram.
- Run a tiny transformer example and log tensor shapes after embeddings, one block, and the output layer.
- Build a comparison sheet showing when to choose encoder-only, decoder-only, or encoder-decoder designs.
- Write a short technical note describing one long-context limitation and why it arises from the architecture.
- Implement or inspect a toy block class and map each line to its architectural role.