Blog Post
Cheatsheet: LLM Architectures
Five LLM architectures — GPT-2, Qwen3-8B, DeepSeek-V3, DeepSeek-R1, GPT-OSS-20B/120B — shown as interactive block diagrams. Click any block to expand equations and parameters. Each model is sourced from its official HF config.json.
Views: –3 min readCite
Every modern LLM shares the same skeleton — Token Embedding → N × (Norm + Attention + Norm + FFN) → LM Head — but the interesting variation lives in three places: (1) the attention mechanism (MHA → GQA → MLA), (2) the FFN type (dense → MoE with routing), and (3) the positional encoding (absolute → RoPE). The jump from GPT-2 to Qwen3 captures the attention side; the jump to DeepSeek captures all three at once. DeepSeek-R1 adds a fourth dimension that the block diagram cannot show: the same architecture as V3 but a fundamentally different training pipeline, using reinforcement learning on verifiable rewards to elicit long reasoning chains without human-labelled chain-of-thought.
The table below compares the six models on the dimensions that matter most for deployment: how many parameters are active per token (which determines FLOPs), the attention type (which determines KV cache size), and the context window (which determines how much of that cache you actually fill). The block diagrams that follow let you drill into the equations and parameter counts for each model.
| Model | Total | Active | Layers | Hidden | Attention | FFN | Experts | Context | PE |
|---|---|---|---|---|---|---|---|---|---|
| GPT-2 | 1.5B | 1.5B | 48 | 1600 | MHA 25H | Dense GELU | — | 1K | Absolute |
| Qwen3-8B | 8B | 8B | 36 | 4096 | GQA 32Q/8KV | Dense SwiGLU | — | 40K | RoPE 1M |
| DeepSeek-V3 | 671B | 37B | 61 | 7168 | MLA 128H | MoE SwiGLU | 256+1 top-8 | 163K | YaRN |
| DeepSeek-R1 | 671B | 37B | 61 | 7168 | MLA 128H | MoE SwiGLU | 256+1 top-8 | 163K | YaRN |
| GPT-OSS-20B | 21B | 3.6B | 24 | 2880 | GQA 64Q/8KV + sliding | MoE SwiGLU | 32 top-4 | 131K | YaRN |
| GPT-OSS-120B | 117B | 5.1B | 36 | 2880 | GQA 64Q/8KV + sliding | MoE SwiGLU | 128 top-4 | 131K | YaRN |
Block diagrams
Click any tab to switch models. Use the + button on each block to expand the equation and parameter details for that sub-layer. Hover over a block for the core equation as a tooltip.
LLM Architecture Block Diagrams
Select a model to see its architecture as a vertical block diagram. Click + on any block to expand equations and parameters.
The key insight from comparing these configs is that MLA (DeepSeek), MoE, and sliding attention each solve a different problem. MLA compresses the KV cache in memory — instead of caching 128 heads × 128 dims per token per layer, it caches a 512-dim latent and decompresses at attention time, cutting KV cache size by 64×. That is the primary innovation that lets DeepSeek-V3 run a 671B model on a manageable number of GPUs. MoE increases total model capacity without increasing active compute — each token routes through only 8 of 256 experts, so 37B parameters fire per token while the model has access to 671B worth of learned associations. Sliding attention (GPT-OSS) controls per-layer context reach: odd layers attend only to a 128-token local window (O(n·128)), even layers attend globally (O(n²)), alternating to blend local syntax with global semantics while halving average attention cost.
For the attention equations behind GQA and MLA — and the KV cache arithmetic — see the attention variants cheatsheet. For what that KV cache means for GPU memory at inference time, including how MLA's compression translates to real memory savings on H100, see the inference GPU architecture post and the KV cache lifecycle post.