Blog Post
The Architecture Playground: What a Transformer Config Actually Buys You
An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.
Views: –3 min readCite
A frontier language model, stripped of its press release, is a short list of integers. Hidden size, number of layers, how many attention heads, how many of those heads share key/value projections, the feed-forward width, the vocabulary, the context length. Those numbers decide the parameter count, the memory each token of context costs at inference, and whether the model is dense or a mixture of experts.
The trouble with reading them off a table is that the consequences are non-linear and they interact. Halving the key/value heads barely touches the parameter count but quarters the KV cache. Switching a dense feed-forward block for a mixture of experts multiplies the total parameters by ten while leaving the compute-per-token almost unchanged. The only way to build intuition is to turn the dials and watch.
So turn them. The playground below is live — every slider recomputes the parameter breakdown, the spatial diagram, the KV-cache memory bar, and the expert-routing view in real time. Load a preset to start from a real published config.
config.json
Parameter summary
⚡ kernel: Flash Attention — same math, IO-aware tiling, O(N) memory vs O(N²)
The forward pass — a live derivation
The whole computation, written out and specialised to the config above. This is the heart of the playground: change a field and the math rewrites itself.
A live derivation of the model's forward pass, specialised to the exact config above. Every shape, branch, and substituted value updates as you drag a slider or load a preset.
- 1
Input embedding
S = 8,192 tokens, H = 4096 → x₀ has shape 8,192 × 4096 - 2
Positional encoding · RoPE
applied to Q and K inside attention — not to x₀. d_h = 128 - repeated for each layer ℓ = 1 … 323
Residual block · pre-norm, RMSNorm
× 32 layers3a · pre-norm residualnormalize inside each residual branch — stable gradients, the modern default3b · RMSNorm definitiondrops mean-centering and bias — cheaper, ~same quality; H = 4096 - 4
Attention · Grouped-Query (GQA) · Flash
× 32 layers4a · query projection32 heads × d_h 128 = H 40964b · key / value projection · GQAGQA: n_kv = 8 heads shared across 4 query groups → K/V width 10244c · RoPE on Q, Krotate Q and K by absolute position before the dot product4d · scaled dot-productd_h = H / n_heads = 4096/32 = 1284e · output projection4f · kernelsame math — IO-aware tiling computes softmax in blocks, so the N×N score matrix is never materialised: O(N) memory instead of O(N²) - 5
Feed-forward · dense SwiGLU
× 32 layersSwiGLU: SiLU/Swish-gated, ⊙ is elementwise3 matrices; H = 4096, F = 14336end layer loop - 6
Output / LM head
V = 128256 vocab; head 4096 × 128256 = 525.3 M params (often tied to the embedding)next-token distribution over the vocabulary
Spatial architecture
The decoder as a stack of layer slabs — token embeddings flow up through attention (purple) and FFN (teal) blocks, N layers deep.
KV cache at inference
Every generated token writes K and V for all layers. This is what actually fills your VRAM during decoding — and why GQA and low-precision caches matter.
Mixture-of-Experts routing
Switch ffn_type to MoE (or pick a Mixtral / DeepSeek preset) to see the router select experts.
This config uses a dense SwiGLU feed-forward block — every token passes through the same up / gate / down weights.
What to look for
Grouped-query attention. Set num_key_value_heads below num_attention_heads and the K and V projection matrices visibly narrow in the per-layer view. The parameter savings are modest; the KV-cache savings are the whole point. LLaMA 3 8B runs 32 query heads over 8 key/value heads — a 4× smaller cache for nearly free.
The KV cache is the inference bottleneck, not the weights. The memory a serving stack fights over during decoding is seq_len × n_kv_heads × d_head × 2 × bytes × n_layers. It grows linearly with context. Push the sequence-length slider toward a long context and watch a 7B model's cache blow past 24 GB even though its weights fit comfortably — then drop the precision to fp8 or int4 and watch it come back.
Mixture of experts decouples size from cost. Switch the FFN type to MoE, or load Mixtral or DeepSeek-V3. Total parameters explode because every expert is a full feed-forward block, but only num_experts_per_tok of them fire for any given token — so the active parameter count, and the FLOPs, stay small. The routing diagram shows the softmax over experts with the chosen top-k lit up.
This is the first entry in an ongoing experiment: research blogs you can operate, not just read. Everything here is computed from the same arithmetic the model configs imply — no approximations hidden behind the diagram.