S. Roy

Blog Post

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

Views: 3 min readCite

A frontier language model, stripped of its press release, is a short list of integers. Hidden size, number of layers, how many attention heads, how many of those heads share key/value projections, the feed-forward width, the vocabulary, the context length. Those numbers decide the parameter count, the memory each token of context costs at inference, and whether the model is dense or a mixture of experts.

The trouble with reading them off a table is that the consequences are non-linear and they interact. Halving the key/value heads barely touches the parameter count but quarters the KV cache. Switching a dense feed-forward block for a mixture of experts multiplies the total parameters by ten while leaving the compute-per-token almost unchanged. The only way to build intuition is to turn the dials and watch.

So turn them. The playground below is live — every slider recomputes the parameter breakdown, the spatial diagram, the KV-cache memory bar, and the expert-routing view in real time. Load a preset to start from a real published config.

config.json

attn_type
Grouped-Query (GQA)
ffn_type (structural)
hidden_act (activation)
gated → adds a gate matrix (3 total)
pos_encoding
norm_type
norm inside each residual branch (modern)

Parameter summary

Total params
7.50 B
32 layers
Active params
7.50 B
dense — all active
d_head
128
Attention
GQA
8/32 kv
KV / token
2,048
2·n_kv·d_h
FFN
SwiGLU
3 matrices

⚡ kernel: Flash Attention — same math, IO-aware tiling, O(N) memory vs O(N²)

The forward pass — a live derivation

The whole computation, written out and specialised to the config above. This is the heart of the playground: change a field and the math rewrites itself.

A live derivation of the model's forward pass, specialised to the exact config above. Every shape, branch, and substituted value updates as you drag a slider or load a preset.

  1. 1

    Input embedding

    x0=Embed(t)RS×Hx_0 = \mathrm{Embed}(t) \in \mathbb{R}^{S \times H}
    S = 8,192 tokens, H = 4096 → x₀ has shape 8,192 × 4096
  2. 2

    Positional encoding · RoPE

    RoPE(x,pos)=xeiposθ,θj=100002j/dh\mathrm{RoPE}(x, pos) = x \cdot e^{\,i\, pos\, \theta}, \quad \theta_j = 10000^{-2j/d_h}
    applied to Q and K inside attention — not to x₀. d_h = 128
  3. repeated for each layer ℓ = 1 … 32
    3

    Residual block · pre-norm, RMSNorm

    × 32 layers
    3a · pre-norm residual
    h=x+Attn(RMSNorm(x))x=h+FFN(RMSNorm(h))\begin{aligned} h &= x + \mathrm{Attn}\big(\mathrm{RMSNorm}(x)\big) \\ x' &= h + \mathrm{FFN}\big(\mathrm{RMSNorm}(h)\big) \end{aligned}
    normalize inside each residual branch — stable gradients, the modern default
    3b · RMSNorm definition
    RMS(x)=x1Hixi2+ϵγ\mathrm{RMS}(x) = \frac{x}{\sqrt{\frac{1}{H}\sum_i x_i^2 + \epsilon}} \cdot \gamma
    drops mean-centering and bias — cheaper, ~same quality; H = 4096
  4. 4

    Attention · Grouped-Query (GQA) · Flash

    × 32 layers
    4a · query projection
    Q=xWQRS×H,WQRH×HQ = x\,W_Q \in \mathbb{R}^{S \times H}, \quad W_Q \in \mathbb{R}^{H \times H}
    32 heads × d_h 128 = H 4096
    4b · key / value projection · GQA
    K=xWK,V=xWVRS×(nkvdh)K = x\,W_K, \quad V = x\,W_V \in \mathbb{R}^{S \times (n_{kv} \cdot d_h)}
    GQA: n_kv = 8 heads shared across 4 query groups → K/V width 1024
    4c · RoPE on Q, K
    Q^i=RoPE(Qi,i),K^j=RoPE(Kj,j)\hat{Q}_i = \mathrm{RoPE}(Q_i, i), \quad \hat{K}_j = \mathrm{RoPE}(K_j, j)
    rotate Q and K by absolute position before the dot product
    4d · scaled dot-product
    Attn(Q,K,V)=softmax ⁣(QKdh)V\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_h}}\right) V
    d_h = H / n_heads = 4096/32 = 128
    4e · output projection
    out=Attn(Q,K,V)WO,WORH×H\mathrm{out} = \mathrm{Attn}(Q,K,V) \cdot W_O, \quad W_O \in \mathbb{R}^{H \times H}
    4f · kernel
    FlashAttention (fused, tiled kernel)\textsf{FlashAttention (fused, tiled kernel)}
    same math — IO-aware tiling computes softmax in blocks, so the N×N score matrix is never materialised: O(N) memory instead of O(N²)
  5. 5

    Feed-forward · dense SwiGLU

    × 32 layers
    FFN(x)=(SiLU(xWgate)xWup)Wdown\mathrm{FFN}(x) = \big(\mathrm{SiLU}(x\,W_{\text{gate}}) \odot x\,W_{\text{up}}\big) W_{\text{down}}
    SwiGLU: SiLU/Swish-gated, ⊙ is elementwise
    Wgate,WupRH×F,WdownRF×HW_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{H \times F}, \quad W_{\text{down}} \in \mathbb{R}^{F \times H}
    3 matrices; H = 4096, F = 14336
    end layer loop
  6. 6

    Output / LM head

    logits=xLWlmRS×V,WlmRH×V\mathrm{logits} = x_L\,W_{\text{lm}} \in \mathbb{R}^{S \times V}, \quad W_{\text{lm}} \in \mathbb{R}^{H \times V}
    V = 128256 vocab; head 4096 × 128256 = 525.3 M params (often tied to the embedding)
    p(ti+1ti)=softmax(logitsi)p(t_{i+1} \mid t_{\le i}) = \mathrm{softmax}(\mathrm{logits}_i)
    next-token distribution over the vocabulary

Spatial architecture

The decoder as a stack of layer slabs — token embeddings flow up through attention (purple) and FFN (teal) blocks, N layers deep.

token embeddings× 32 layers(8 shown)attentionSwiGLU FFNLayer stackd_model = 4,096Inside one layer — weight matrices (sized ∝ dimensions)Self-attentionQ4,096×4,096K4,096×1,024V4,096×1,024O4,096×4,096GQA: 32 query heads share 8 K/V heads → K/V projections 4× narrowerFFN — dense, SwiGLUup4,096×14,336gate4,096×14,336down14,336×4,096
Attention mask · full causalrows = query i, cols = key j · bright = attended→ key position jeach token attends to all earlier tokens (lower triangle); future is masked

KV cache at inference

Every generated token writes K and V for all layers. This is what actually fills your VRAM during decoding — and why GQA and low-precision caches matter.

kv cache precision
KV cache total
1.00 GB
at 8,192 tokens
per layer
32.0 MB
32 layers
per token
128.0 KB
grows linearly with seq_len
memory = n_layers × 2 × n_kv × d_head × seq_len × bytes32 of 32 layer slabs · 8 KV heads × 128 d_head × 2 (K+V)1.00 GB @ bf16
✓ fits 24 GB VRAM (RTX 4090)KV cache alone, weights not counted
✓ fits 80 GB VRAM (A100/H100)KV cache alone, weights not counted

Mixture-of-Experts routing

Switch ffn_type to MoE (or pick a Mixtral / DeepSeek preset) to see the router select experts.

This config uses a dense SwiGLU feed-forward block — every token passes through the same up / gate / down weights.

What to look for

Grouped-query attention. Set num_key_value_heads below num_attention_heads and the K and V projection matrices visibly narrow in the per-layer view. The parameter savings are modest; the KV-cache savings are the whole point. LLaMA 3 8B runs 32 query heads over 8 key/value heads — a 4× smaller cache for nearly free.

The KV cache is the inference bottleneck, not the weights. The memory a serving stack fights over during decoding is seq_len × n_kv_heads × d_head × 2 × bytes × n_layers. It grows linearly with context. Push the sequence-length slider toward a long context and watch a 7B model's cache blow past 24 GB even though its weights fit comfortably — then drop the precision to fp8 or int4 and watch it come back.

Mixture of experts decouples size from cost. Switch the FFN type to MoE, or load Mixtral or DeepSeek-V3. Total parameters explode because every expert is a full feed-forward block, but only num_experts_per_tok of them fire for any given token — so the active parameter count, and the FLOPs, stay small. The routing diagram shows the softmax over experts with the chosen top-k lit up.

This is the first entry in an ongoing experiment: research blogs you can operate, not just read. Everything here is computed from the same arithmetic the model configs imply — no approximations hidden behind the diagram.

Cite this work

Generated from article front matter.

Roy, Swastik. (2024). The Architecture Playground: What a Transformer Config Actually Buys You. S. Roy. https://swastikroy.me/blog/architecture-playground-transformer

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.