Subject 23

Attention mechanisms

Attention lets a model decide which tokens matter most for the current computation instead of compressing everything into one fixed summary. It started as a fix for Seq2Seq bottlenecks and became one of the most important ideas in modern NLP.

Core idea

When a model processes language, not every earlier word is equally useful. Attention gives the model a way to assign larger weight to the most relevant tokens at each step, instead of treating all context the same.

Why attention was introduced

Early encoder-decoder systems squeezed an entire input sentence into one fixed vector before decoding. That created an information bottleneck, especially for long inputs. Attention fixed this by letting the decoder look back at all source representations and focus on the parts that matter for the current output token.

Query, key, value

Think of attention as a learned lookup process. Every token produces three things:

The model compares the query against all keys to get relevance scores, converts those scores into weights using softmax, then takes a weighted sum of the values. Higher score means more of that token's value flows through.

Current token
  |
  v
compare query against all keys โ†’ relevance scores
  |
  v
softmax โ†’ weights that sum to 1
  |
  v
weighted sum of values โ†’ context-aware representation

Attention is soft: the model distributes probability mass across many tokens rather than picking one with a hard yes/no rule.

Types of attention and where they appear

Type What attends to what Where it is used
Self-attention A sequence attends to itself โ€” every token looks at every other token in the same sequence. Transformer encoder layers (e.g. BERT). Each word learns how it relates to every other word in the sentence.
Self-attention mechanism diagram showing Query, Key, Value linear projections, MatMul, Scale, Softmax steps, and the formula softmax(QยทKแต€ / scaling)ยทV
Learning Self-Attention with Neural Networks
Cross-attention One sequence attends to a different sequence โ€” queries come from the target, keys and values come from the source. Encoder-decoder models like T5 or classic machine translation. The decoder looks at all encoder outputs to decide what source content to use next.
Causal (masked) attention Each position can only see itself and earlier positions โ€” future tokens are blocked. Autoregressive language models like GPT. Ensures the model can only use past context when predicting the next token.

Multi-head attention

A single attention pattern is often too limited to capture everything useful. Multi-head attention runs several attention operations in parallel, each in its own subspace. Different heads specialize in different relationships โ€” one might track local syntax, another long-range dependencies, another coreference. Their outputs are concatenated and projected back to the model dimension.

Advanced

At advanced depth, attention is best understood as a family of mechanisms with different scoring functions, efficiency trade-offs, and failure modes. The central questions are how relevance is scored, what information flow is permitted, and what cost is paid as context grows.

Additive vs dot-product attention

Form Idea Typical note
Additive attention Uses a small learned feed-forward scorer over the query and key. Associated with early encoder-decoder attention such as Bahdanau. More flexible but slower.
Dot-product attention Uses vector dot products for similarity, scaled by the square root of the key dimension to keep scores stable. Efficient on modern hardware and the standard in transformers.

Masking controls visibility

Q ร— K^T โ†’ raw scores
raw scores + mask โ†’ blocked positions get large negative values
softmax โ†’ normalized weights
weights ร— V โ†’ output

Cost on long sequences

Full self-attention compares every token against every other token, so the score matrix has n ร— n entries. Memory and compute grow quadratically with sequence length โ€” doubling the sequence roughly quadruples the cost. This is why long-context NLP is hard.

Efficiency approaches

What attention weights do and do not tell you

Attention maps can be informative, but they are not a complete explanation of why a model made a decision. A token may receive high weight yet not be the only cause of the final output. Residual connections, MLP layers, and layer stacking all affect the result. Attention is useful evidence, not a full causal explanation.

Common failure modes

A good mental model: attention is a content-addressable routing mechanism. It decides which stored representations should influence the current token and by how much.

To-do list

Learn

  • Understand queries, keys, values, scores, and normalized attention weights.
  • Learn why attention solved the fixed-vector bottleneck in early Seq2Seq models.
  • Study self-attention, cross-attention, and causal attention as distinct information-flow patterns.
  • Understand why scaling stabilizes dot-product attention.
  • Learn why full attention becomes expensive as sequence length grows.

Practice

  • Work through a tiny attention example by hand and compute the softmax weights.
  • Draw a toy attention matrix and mark which entries are blocked by a causal mask.
  • Explain in your own words how multi-head attention differs from single-head attention.
  • Compare additive attention and dot-product attention in one paragraph.
  • Estimate how attention cost changes when sequence length doubles.

Build

  • Create a small script that computes scaled dot-product attention for toy vectors.
  • Build a visualization that shows how a causal mask changes the attention matrix.
  • Write a short note explaining when self-attention and cross-attention are used.
  • Summarize one long-context efficiency idea such as sparse attention, MQA, GQA, or FlashAttention.
  • Make a one-page cheat sheet of the attention formula, shapes, and mask types.