Tag: interpretability

Blog Post·2025-06-20·8 min read

Logit Lens: How Predictions Form Layer by Layer

Applying the unembedding matrix at intermediate layers to watch how a transformer's prediction evolves — and what direct logit attribution tells us about which components matter.

mechanistic-interpretability logit-lens direct-logit-attribution interpretability transformers

Blog Post·2025-06-20·12 min read

Open Problems in Mechanistic Interpretability

Faithfulness vs. plausibility, scaling to frontier models, the composition problem, automated interpretability, and what it would take to actually understand a large language model.

mechanistic-interpretability open-problems ai-safety interpretability