The Architecture Playground: What a Transformer Config Actually Buys You

Swastik Roy

Blog Post

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

June 19, 2024Views: –3 min readCite

architecture transformers interactive kv-cache moe

A frontier language model, stripped of its press release, is a short list of integers. Hidden size, number of layers, how many attention heads, how many of those heads share key/value projections, the feed-forward width, the vocabulary, the context length. Those numbers decide the parameter count, the memory each token of context costs at inference, and whether the model is dense or a mixture of experts.

The trouble with reading them off a table is that the consequences are non-linear and they interact. Halving the key/value heads barely touches the parameter count but quarters the KV cache. Switching a dense feed-forward block for a mixture of experts multiplies the total parameters by ten while leaving the compute-per-token almost unchanged. The only way to build intuition is to turn the dials and watch.

So turn them. The playground below is live — every slider recomputes the parameter breakdown, the spatial diagram, the KV-cache memory bar, and the expert-routing view in real time. Load a preset to start from a real published config.

config.json

hidden_size4,096num_hidden_layers32num_attention_heads32

attn_type

Grouped-Query (GQA)kernel: Flash Attention

num_key_value_heads8GQA — grouped K/V headsintermediate_size14,336vocab_size128,256max_position_embeddings8,192

ffn_type (structural)

hidden_act (activation)

gated → adds a gate matrix (3 total)

pos_encoding

norm_type

norm inside each residual branch (modern)

Parameter summary

Total params

7.50 B

32 layers

Active params

7.50 B

dense — all active

d_head

128

Attention

GQA

8/32 kv

KV / token

2,048

2·n_kv·d_h

FFN

SwiGLU

3 matrices

⚡ kernel: Flash Attention — same math, IO-aware tiling, O(N) memory vs O(N²)

The forward pass — a live derivation

The whole computation, written out and specialised to the config above. This is the heart of the playground: change a field and the math rewrites itself.

A live derivation of the model's forward pass, specialised to the exact config above. Every shape, branch, and substituted value updates as you drag a slider or load a preset.

1
Input embedding
$x_0 = \mathrm{Embed}(t) \in \mathbb{R}^{S \times H}$
S = 8,192 tokens, H = 4096 → x₀ has shape 8,192 × 4096
2
Positional encoding · RoPE
$\mathrm{RoPE}(x, pos) = x \cdot e^{\,i\, pos\, \theta}, \quad \theta_j = 10000^{-2j/d_h}$
applied to Q and K inside attention — not to x₀. d_h = 128
repeated for each layer ℓ = 1 … 32
3
Residual block · pre-norm, RMSNorm
× 32 layers
3a · pre-norm residual
$\begin{aligned} h &= x + \mathrm{Attn}\big(\mathrm{RMSNorm}(x)\big) \\ x' &= h + \mathrm{FFN}\big(\mathrm{RMSNorm}(h)\big) \end{aligned}$
normalize inside each residual branch — stable gradients, the modern default
3b · RMSNorm definition
$\mathrm{RMS}(x) = \frac{x}{\sqrt{\frac{1}{H}\sum_i x_i^2 + \epsilon}} \cdot \gamma$
drops mean-centering and bias — cheaper, ~same quality; H = 4096
4
Attention · Grouped-Query (GQA) · Flash
× 32 layers
4a · query projection
$Q = x\,W_Q \in \mathbb{R}^{S \times H}, \quad W_Q \in \mathbb{R}^{H \times H}$
32 heads × d_h 128 = H 4096
4b · key / value projection · GQA
$K = x\,W_K, \quad V = x\,W_V \in \mathbb{R}^{S \times (n_{kv} \cdot d_h)}$
GQA: n_kv = 8 heads shared across 4 query groups → K/V width 1024
4c · RoPE on Q, K
$\hat{Q}_i = \mathrm{RoPE}(Q_i, i), \quad \hat{K}_j = \mathrm{RoPE}(K_j, j)$
rotate Q and K by absolute position before the dot product
4d · scaled dot-product
$\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_h}}\right) V$
d_h = H / n_heads = 4096/32 = 128
4e · output projection
$\mathrm{out} = \mathrm{Attn}(Q,K,V) \cdot W_O, \quad W_O \in \mathbb{R}^{H \times H}$
4f · kernel
$\textsf{FlashAttention (fused, tiled kernel)}$
same math — IO-aware tiling computes softmax in blocks, so the N×N score matrix is never materialised: O(N) memory instead of O(N²)
5
Feed-forward · dense SwiGLU
× 32 layers
$\mathrm{FFN}(x) = \big(\mathrm{SiLU}(x\,W_{\text{gate}}) \odot x\,W_{\text{up}}\big) W_{\text{down}}$
SwiGLU: SiLU/Swish-gated, ⊙ is elementwise
$W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{H \times F}, \quad W_{\text{down}} \in \mathbb{R}^{F \times H}$
3 matrices; H = 4096, F = 14336
end layer loop
6
Output / LM head
$\mathrm{logits} = x_L\,W_{\text{lm}} \in \mathbb{R}^{S \times V}, \quad W_{\text{lm}} \in \mathbb{R}^{H \times V}$
V = 128256 vocab; head 4096 × 128256 = 525.3 M params (often tied to the embedding)
$p(t_{i+1} \mid t_{\le i}) = \mathrm{softmax}(\mathrm{logits}_i)$
next-token distribution over the vocabulary

Spatial architecture

The decoder as a stack of layer slabs — token embeddings flow up through attention (purple) and FFN (teal) blocks, N layers deep.

KV cache at inference

Every generated token writes K and V for all layers. This is what actually fills your VRAM during decoding — and why GQA and low-precision caches matter.

inference seq_len8,192max context 8,192

kv cache precision

KV cache total

1.00 GB

at 8,192 tokens

per layer

32.0 MB

32 layers

per token

128.0 KB

grows linearly with seq_len

✓ fits 24 GB VRAM (RTX 4090)KV cache alone, weights not counted

✓ fits 80 GB VRAM (A100/H100)KV cache alone, weights not counted

Mixture-of-Experts routing

Switch ffn_type to MoE (or pick a Mixtral / DeepSeek preset) to see the router select experts.

This config uses a dense SwiGLU feed-forward block — every token passes through the same up / gate / down weights.

What to look for

Grouped-query attention. Set num_key_value_heads below num_attention_heads and the K and V projection matrices visibly narrow in the per-layer view. The parameter savings are modest; the KV-cache savings are the whole point. LLaMA 3 8B runs 32 query heads over 8 key/value heads — a 4× smaller cache for nearly free.

The KV cache is the inference bottleneck, not the weights. The memory a serving stack fights over during decoding is seq_len × n_kv_heads × d_head × 2 × bytes × n_layers. It grows linearly with context. Push the sequence-length slider toward a long context and watch a 7B model's cache blow past 24 GB even though its weights fit comfortably — then drop the precision to fp8 or int4 and watch it come back.

Mixture of experts decouples size from cost. Switch the FFN type to MoE, or load Mixtral or DeepSeek-V3. Total parameters explode because every expert is a full feed-forward block, but only num_experts_per_tok of them fire for any given token — so the active parameter count, and the FLOPs, stay small. The routing diagram shows the softmax over experts with the chosen top-k lit up.

This is the first entry in an ongoing experiment: research blogs you can operate, not just read. Everything here is computed from the same arithmetic the model configs imply — no approximations hidden behind the diagram.

The Architecture Playground: What a Transformer Config Actually Buys You

config.json

Parameter summary

The forward pass — a live derivation

Input embedding

Positional encoding · RoPE

Residual block · pre-norm, RMSNorm

Attention · Grouped-Query (GQA) · Flash

Feed-forward · dense SwiGLU

Output / LM head

Spatial architecture

KV cache at inference

Mixture-of-Experts routing

What to look for

How to cite this article

Cite this work