Subject 26

Interpretability in transformers

Interpretability in transformers asks what information is stored inside the model, which internal parts matter for a prediction, and whether we can describe any of the computation in human-understandable terms. It is useful for debugging, trust, model science, and targeted intervention, but it is not the same thing as a complete proof of model reasoning.

Beginner

When a transformer gives an answer, interpretability asks questions such as:

Common families of interpretability tools

Method family Main question Typical caution
Attention visualization Where did a token attend? Attention is evidence, not a full explanation.
Attribution or saliency Which inputs most changed the output? Gradients can be noisy or unstable.
Probing What information is decodable from hidden states? Decodable does not always mean causally used.
Interventions What happens if we remove or replace a component? Bad intervention design can create misleading artifacts.
Prompt tokens
   -> embeddings
   -> repeated transformer layers
   -> hidden states, heads, MLP activations
   -> logits
   -> output token

Interpretability asks what can be learned at each arrow or internal state.
tools = ["attention visualization", "attribution", "probing", "intervention"]
print(tools)

Simple example: suppose a transformer classifier labels support emails as urgent. Interpretability can help check whether it is responding to the actual complaint text or merely to superficial signals such as all-caps words, signatures, or a ticket template.

Scope: this module is about understanding internal model behavior. Keep it separate from attention mechanics, transformer architecture basics, hallucination defense, responsible AI policy, and evaluation metrics. Those belong in other modules.

Intermediate

At intermediate depth, it helps to separate descriptive methods from causal methods.

Descriptive vs causal evidence

Important transformer objects to inspect

Widely used techniques

Technique What it gives you Main limitation
Probing classifier Tests whether information is linearly decodable from activations Encoded information may not be used by the model
Logit lens Projects intermediate states into vocabulary space to inspect evolving predictions Can oversimplify because later layers can still transform the state
Head or neuron ablation Measures how much behavior degrades when a component is removed Redundancy can hide importance if many components share the same role
Activation patching Replaces internal activations from one run into another to localize useful computation Needs careful task design and clean corrupted examples
Sparse autoencoders Attempts to recover more interpretable latent features from superposed activations Recovered features can still be incomplete or hard to validate

Example: a simple probe

from sklearn.linear_model import LogisticRegression

# X: hidden states from one transformer layer
# y: labels such as whether the token is inside a named entity
probe = LogisticRegression(max_iter=2000)
probe.fit(X_train, y_train)

score = probe.score(X_valid, y_valid)
print(f"probe accuracy: {score:.3f}")

A good probe result means the representation contains usable information about the label. It does not by itself prove that the model relies on that information when producing its final answer.

Example: activation patching idea

clean_run = model("The capital of France is Paris")
corrupted_run = model("The capital of France is Rome")

patched = patch_activation(
    source=clean_run,
    target=corrupted_run,
    layer=10,
    position=-1,
)

print(read_top_token(patched))

If patching one activation from the clean run into the corrupted run restores the correct prediction, that is strong evidence that the patched location carried important information for the answer.

Attention maps are often over-interpreted. Research has repeatedly shown that attention weights can disagree with other feature-importance measures, and different attention patterns can sometimes yield similar outputs.

Advanced

Advanced interpretability aims for mechanistic understanding: not just observing correlations, but describing how parts of the transformer cooperate to implement a computation. In the strongest version, you can name a circuit, intervene on it, and predict the effect of the intervention.

Mechanistic interpretability

Input prompt
   -> early layers build local and lexical features
   -> middle layers combine and route information
   -> later layers sharpen task-specific predictions
   -> logits

Mechanistic interpretability asks which features and components form the path
from evidence in the prompt to the final logit change.

What strong evidence looks like

Typical failure modes in interpretability work

Practical uses for engineers and researchers

A compact workflow

questions = [
			    "What feature or behavior am I trying to explain?",
			    "What descriptive signal suggests a hypothesis?",
			    "What causal intervention can test it?",
			    "Does the result generalize beyond one prompt?"
			]
print(questions)

A useful mental model is: descriptive tools help generate hypotheses, causal tools help test them, and mechanistic claims should survive both.

The field is still incomplete. Some elegant results come from small or narrow settings, while large real transformers remain only partially understood. Strong claims should be calibrated accordingly.

To-do list

Learn

  • Understand the difference between descriptive evidence and causal evidence.
  • Learn what probes, saliency, ablations, logit lens, and activation patching each can and cannot show.
  • Study why attention weights alone are not a complete explanation of model behavior.
  • Learn the ideas of circuits, feature directions, polysemanticity, and superposition at a conceptual level.
  • Understand why interpretability is valuable for debugging but usually does not produce certainty.

Practice

  • Run attention visualization on a short prompt and write down two things it suggests and two things it does not prove.
  • Train a simple probe on hidden states for a token-level property such as POS tag or named entity membership.
  • Ablate one attention head or one neuron group and measure the change on a small controlled task.
  • Design a clean versus corrupted example pair for a simple activation patching experiment.
  • Inspect one model failure and propose whether it looks like shortcut use, missing knowledge, or a broken internal routing pattern.

Build

  • Build a small notebook or script that compares attention visualization, attribution, and ablation on the same prompt.
  • Create a mini dashboard that shows hidden-state probe scores by layer.
  • Implement a toy activation patching workflow for a factual prompt pair.
  • Write a one-page interpretability report that states the hypothesis, evidence, intervention, result, and remaining uncertainty.
  • Document how you would scale the same workflow from a toy example to a real engineering debugging task.