Circuits: How Transformers Implement Algorithms
How to identify the minimal subgraph of attention heads and MLP layers that implements a specific behavior — and what we've learned from the indirect object identification circuit in GPT-2.
How to identify the minimal subgraph of attention heads and MLP layers that implements a specific behavior — and what we've learned from the indirect object identification circuit in GPT-2.
Applying the unembedding matrix at intermediate layers to watch how a transformer's prediction evolves — and what direct logit attribution tells us about which components matter.
The linear representation hypothesis, superposition, polysemanticity, and why transformer activations are more structured than they look.
Attention heads as information-routing circuits, MLP layers as key-value memories, and the residual stream as a shared communication bus.
An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.
DDPM, DDIM, and latent diffusion all use a U-Net backbone. DiT replaces it with a transformer — and finds that diffusion scales with model size the same way language models do.
GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.
Every forward pass is a sequence of matrix multiplications. Understanding what those matrices do — rotate, scale, project — is the foundation for understanding why transformers work.
A single training step involves linear algebra, probability, information theory, optimization, and statistical estimation — all at once. Here's how the pieces fit.
GPT-2 established the decoder-only transformer as the dominant paradigm. What followed was six years of systematic improvements — in scale, efficiency, alignment, and reasoning. Here's the arc.
The FFN block consumes most of a transformer's parameters. The choices made there — activation function, gating, expert routing — account for much of the quality gap between model families.
Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.
Up to 30% of GPU compute can vanish into padding tokens that contribute nothing to learning. Here's how modern pretraining pipelines eliminate that waste.
Most training failures leave signatures in the metrics before they fully manifest. Here's how to read loss curves, gradient norms, learning rate schedules, and activation statistics to diagnose what's going wrong.
Post-LN dominated the original transformer. Pre-LN dominates everything since GPT-2. The reason comes down to gradient flow — and the math is clean enough to be worth understanding.
Adam is the default optimizer for language model training, but using it correctly — the right β values, weight decay, learning rate schedule — makes a larger difference than most people expect.
Attention is permutation-invariant. Positional encodings break that symmetry. The choice of encoding method determines whether your model can generalize to longer sequences than it trained on.