Blog Post··8 min read
Logit Lens: How Predictions Form Layer by Layer
Applying the unembedding matrix at intermediate layers to watch how a transformer's prediction evolves — and what direct logit attribution tells us about which components matter.
Applying the unembedding matrix at intermediate layers to watch how a transformer's prediction evolves — and what direct logit attribution tells us about which components matter.
Faithfulness vs. plausibility, scaling to frontier models, the composition problem, automated interpretability, and what it would take to actually understand a large language model.