Tag: transformers

Blog Post·2025-06-20·12 min read

Circuits: How Transformers Implement Algorithms

How to identify the minimal subgraph of attention heads and MLP layers that implements a specific behavior — and what we've learned from the indirect object identification circuit in GPT-2.

mechanistic-interpretability circuits activation-patching indirect-object-identification transformers

Blog Post·2025-06-20·8 min read

Logit Lens: How Predictions Form Layer by Layer

Applying the unembedding matrix at intermediate layers to watch how a transformer's prediction evolves — and what direct logit attribution tells us about which components matter.

mechanistic-interpretability logit-lens direct-logit-attribution interpretability transformers

Blog Post·2025-06-20·11 min read

Representation Geometry: How Neural Networks Encode Meaning

The linear representation hypothesis, superposition, polysemanticity, and why transformer activations are more structured than they look.

mechanistic-interpretability representation-learning superposition linear-representation-hypothesis transformers

Blog Post·2025-06-20·9 min read

What Each Transformer Component Actually Does

Attention heads as information-routing circuits, MLP layers as key-value memories, and the residual stream as a shared communication bus.

mechanistic-interpretability transformers attention-heads mlp-layers residual-stream

Blog Post·2024-06-19·3 min read

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

architecture transformers interactive kv-cache moe

Blog Post·2024-06-19·6 min read

DiT: Replacing the U-Net with a Transformer

DDPM, DDIM, and latent diffusion all use a U-Net backbone. DiT replaces it with a transformer — and finds that diffusion scales with model size the same way language models do.

diffusion dit transformers generative-models image-generation

Blog Post·2024-06-19·9 min read

Comparing Large Model Architectures: Attention, Normalization, and Scale

GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.

architecture transformers llm comparison moe

Blog Post·2024-06-19·6 min read

Linear Algebra for LLMs: Vectors, Matrices, and What They Do

Every forward pass is a sequence of matrix multiplications. Understanding what those matrices do — rotate, scale, project — is the foundation for understanding why transformers work.

math linear-algebra transformers

Blog Post·2024-06-19·7 min read

Putting It Together: The Mathematics of a Training Run

A single training step involves linear algebra, probability, information theory, optimization, and statistical estimation — all at once. Here's how the pieces fit.

math llm-training transformers synthesis

Blog Post·2024-06-19·8 min read

How Decoder-Only Transformers Evolved Since GPT-2

GPT-2 established the decoder-only transformer as the dominant paradigm. What followed was six years of systematic improvements — in scale, efficiency, alignment, and reasoning. Here's the arc.

transformers gpt architecture history llm

Blog Post·2024-06-19·6 min read

Inside the FFN: MoE, SwiGLU, and the Architectural Details That Scale

The FFN block consumes most of a transformer's parameters. The choices made there — activation function, gating, expert routing — account for much of the quality gap between model families.

transformers moe swiglu architecture llm-training

Blog Post·2024-06-19·6 min read

Attention Variants: MHA, MQA, GQA, and the Memory Math Behind Them

Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.

transformers attention gqa architecture inference

Blog Post·2024-06-19·4 min read

Data Efficiency in Pretraining: Packing, Batching, and What Gets Wasted

Up to 30% of GPU compute can vanish into padding tokens that contribute nothing to learning. Here's how modern pretraining pipelines eliminate that waste.

transformers llm-training sequence-packing pretraining

Blog Post·2024-06-19·13 min read

Debugging Transformer Training Runs: Reading the Curves

Most training failures leave signatures in the metrics before they fully manifest. Here's how to read loss curves, gradient norms, learning rate schedules, and activation statistics to diagnose what's going wrong.

transformers llm-training debugging training-dynamics

Blog Post·2024-06-19·4 min read

Normalization in Transformers: Why Pre-LN Became the Default

Post-LN dominated the original transformer. Pre-LN dominates everything since GPT-2. The reason comes down to gradient flow — and the math is clean enough to be worth understanding.

transformers normalization llm-training

Blog Post·2024-06-19·4 min read

Optimizers for LLMs: Adam, Weight Decay, and Why Learning Rate Matters More Than You Think

Adam is the default optimizer for language model training, but using it correctly — the right β values, weight decay, learning rate schedule — makes a larger difference than most people expect.

transformers optimizer llm-training adam

Blog Post·2024-06-19·6 min read

Positional Encodings: From Sinusoids to RoPE

Attention is permutation-invariant. Positional encodings break that symmetry. The choice of encoding method determines whether your model can generalize to longer sequences than it trained on.

transformers rope positional-encoding architecture