Transformers

Beginner

A transformer reads a sequence as a set of token representations that repeatedly exchange information across layers. Instead of carrying one hidden state forward step by step like an RNN, each layer lets a token update itself using other tokens in the sequence plus its own current representation. That makes long-range interactions easier to model and lets training process many positions at once.

Why transformers replaced recurrence

Direct token interaction: important words can influence each other without passing through a long recurrent chain.
Parallel training: many sequence positions can be processed simultaneously during training.
Stable scaling: stacked residual blocks and normalization made it easier to train deeper, larger models.
Flexible architecture: the same block pattern works for understanding, generation, and conditional generation.

High-level data flow

The model first turns tokens into vectors, injects position information, and then passes those vectors through a stack of transformer blocks. Each block has two broad jobs: mix information across tokens, then transform each token representation independently with a small feed-forward network.

Embedding layer: maps token IDs into dense vectors.
Positional signal: gives the model information about order because token mixing alone is permutation-insensitive.
Attention sublayer: lets each token gather context from other visible tokens.
Feed-forward sublayer: applies a learned nonlinear transformation to each token position.
Residual paths: preserve and refine information instead of overwriting it at every layer.

Input text
   -> tokenization
   -> token embeddings + positional information
   -> transformer block 1
   -> transformer block 2
   -> ...
   -> final contextual representations
   -> task head or next-token head

What a transformer block is doing

A block alternates between communication and computation. In the communication step, token representations borrow information from other positions. In the computation step, each position is transformed independently into a richer local representation. Stacking many blocks lets the model build higher-level abstractions layer by layer.

components = [
    "token embeddings",
    "positional information",
    "attention sublayer",
    "feed-forward sublayer",
    "residual connections",
    "layer normalization",
]

print(components)

Three common transformer families

Family	Main structure	Best for
Encoder-only	Stack of encoder blocks	Classification, tagging, retrieval-style representations, sentence understanding
Decoder-only	Stack of causal decoder blocks	Autoregressive next-token generation
Encoder-decoder	Encoder stack plus decoder stack	Tasks that generate an output conditioned on an input sequence

Encoder-only:      input -> contextual representations

Decoder-only:     prompt -> next token -> next token -> next token

Encoder-decoder:  source -> encoder states -> decoder -> target sequence

Real-world example: a transformer can connect a pronoun to its likely referent several tokens away, detect the broader sentence context, and then expose that context to a classifier or generator through the final hidden states.

Scope note: this module is about transformer architecture and design logic. Keep detailed attention math in the attention module, pretraining objectives in the BERT/pretraining module, and large-model deployment concerns in the LLM and serving modules.

Advanced

At advanced depth, a transformer is best viewed as a repeated architecture template for representation mixing. Each block combines cross-token interaction, per-token nonlinear transformation, residual addition, and normalization. The architecture became dominant not because one component was magical in isolation, but because the whole stack trained well, scaled well, and generalized across many sequence problems.

Inside a modern transformer block

Attention sublayer: mixes information across visible positions so token representations become context-dependent.
Feed-forward network: usually a two-layer position-wise MLP that expands and projects representations.
Residual connections: help optimization by preserving a direct path for information and gradients.
Layer normalization: stabilizes training and keeps activation scales manageable.
Masking rules: determine which positions are visible, which is what separates encoder-style bidirectional processing from decoder-style causal generation.

Input states
   -> normalization
   -> attention-based mixing
   -> residual add
   -> normalization
   -> feed-forward transformation
   -> residual add
   -> output states

transformer_block = {
    "mix_across_tokens": "attention sublayer",
    "transform_each_token": "feed-forward network",
    "stability": ["residual path", "layer normalization"],
    "visibility_rule": "bidirectional or causal mask",
}

print(transformer_block)

Encoder, decoder, and encoder-decoder as architectural choices

The biggest design decision is not the number of layers first. It is what information flow the task needs.

Encoder-only: every token can usually use both left and right context, which is ideal when the goal is to build strong representations of the whole input.
Decoder-only: each position can use only earlier tokens, which matches left-to-right generation.
Encoder-decoder: the decoder generates autoregressively while also consulting encoded source representations, which suits input-to-output mapping.

Why the architecture scaled so well

Property	Why it helped
Parallelizable training	Much more hardware-friendly than pure recurrence over long sequences
Residual block design	Supported deeper stacks and more stable optimization
Architecture reuse	One general template could be adapted to understanding and generation settings
Contextual representations	Token meaning changes with context rather than staying fixed

Training-time and inference-time behavior

Transformers are often discussed as if they behave identically in all settings, but training and inference expose different constraints.

Training: many tokens can usually be processed together in batches, which is where the architecture gains much of its efficiency.
Autoregressive inference: decoder-style generation is still sequential over output positions even though the internal block is parallel within a step.
Representation tasks: encoder-style models can score or encode an entire input in one forward pass.

Main weaknesses and design limits

Long-context cost: full token-to-token interaction becomes expensive as sequence length grows.
Positional dependence: order has to be injected explicitly; it is not built into attention by default.
Memory pressure: deep stacks and long sequences can make training and serving resource-intensive.
Architecture is not objective: a transformer tells you the network structure, not how it was pretrained or aligned.

Why long inputs are hard

More tokens
   -> more token-to-token comparisons
   -> more memory and compute
   -> harder batching and slower throughput

Keep the boundaries clear: do not collapse transformers into attention math, BERT recipes, scaling-law analysis, or LLM application patterns. Those topics build on transformers, but the core topic here is the architecture template itself.

Understanding transformers well means being able to explain what each block contributes, why encoder and decoder variants differ, and why the architecture became a general foundation rather than a one-off translation model.

Study Guide: Transformers, BERT, and GPT

Transformers are still neural networks trained with the same ideas you've learned — feedforward layers, activation functions, loss, and backpropagation. The key difference is how information flows between tokens.

1. Why Transformers Were Introduced

Before transformers, NLP relied on RNNs, LSTMs, and GRUs. These had key problems:

Hard to parallelize
Struggle with long-range dependencies
Slow training on long sequences

Transformers (introduced in "Attention Is All You Need") solve this by replacing recurrence with attention — words in a sentence directly attend to other words.

Sentence: "The animal didn't cross the street because it was tired."
"it" → should attend to → "animal"

2. Transformer Architecture Overview

Each transformer block has two main parts: a self-attention layer and a feedforward network.

Input Tokens
  -> Token Embeddings
  -> Positional Encoding
  -> Self-Attention
  -> Feedforward Network
  -> Output

3. Self-Attention

Self-attention allows every word to look at all other words. Each token creates three vectors:

Query (Q) — what am I looking for?
Key (K) — what does each word contain?
Value (V) — the information to pass forward

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

4. Multi-Head Attention

Instead of one attention calculation, transformers run multiple attention heads in parallel. Different heads learn different relationships — grammar, subject–verb agreement, coreference, semantic similarity.

5. Positional Encoding

Transformers have no built-in sense of order, so positional encodings are added to token embeddings:

embedding(word) + positional_encoding(position)

6. Encoder vs Decoder

Type	Used for	Example model
Encoder	Understanding (classification, NER, sentiment)	BERT
Decoder	Generation (chatbots, summarization)	GPT

7. BERT (Encoder-Only)

BERT reads text bidirectionally — it sees left and right context simultaneously.

Pretraining tasks:

Masked Language Modeling (MLM): some tokens are hidden and the model predicts them.
```
"The cat sat on the [MASK]" → "mat"
```
Next Sentence Prediction (NSP): model learns if sentence B logically follows sentence A.

8. GPT (Decoder-Only)

GPT predicts the next token given all previous tokens — it reads left to right only.

Training: "The cat sat on the" → predict "mat"
Objective: maximize P(next_token | previous_tokens)

Common uses: chatbots, text generation, summarization, coding assistants, translation.

9. BERT vs GPT Comparison

Feature	BERT	GPT
Architecture	Encoder	Decoder
Reading direction	Bidirectional	Left → Right
Main task	Understanding	Generation
Training objective	Masked token prediction	Next-token prediction

To-do list

Learn

Understand the full transformer pipeline from token embeddings through stacked blocks to task outputs.
Learn the role of attention sublayers, feed-forward layers, residual paths, normalization, and masking.
Study the difference between encoder-only, decoder-only, and encoder-decoder architectures.
Understand why positional information is required in transformer models.
Know why transformers train efficiently yet still face long-sequence cost challenges.

Practice

Trace one input sentence through embeddings, positional encoding, one transformer block, and an output head.
Explain in your own words what changes when the same block is used in encoder-only versus decoder-only mode.
Compare transformer parallel training with recurrent step-by-step processing on the same toy sequence.
Read a minimal transformer implementation and label where attention, normalization, residuals, and the MLP appear.
Draw the visibility pattern for bidirectional processing versus causal generation.

Build

Create a one-page architecture note that explains a transformer block from memory with a clean diagram.
Run a tiny transformer example and log tensor shapes after embeddings, one block, and the output layer.
Build a comparison sheet showing when to choose encoder-only, decoder-only, or encoder-decoder designs.
Write a short technical note describing one long-context limitation and why it arises from the architecture.
Implement or inspect a toy block class and map each line to its architectural role.