Cheatsheet: LLM Architectures

Swastik Roy

Blog Post

Cheatsheet: LLM Architectures

Five LLM architectures — GPT-2, Qwen3-8B, DeepSeek-V3, DeepSeek-R1, GPT-OSS-20B/120B — shown as interactive block diagrams. Click any block to expand equations and parameters. Each model is sourced from its official HF config.json.

January 10, 2025Views: –3 min readCite

cheatsheet architecture transformers gpt2 deepseek qwen moe mla gqa

Every modern LLM shares the same skeleton — Token Embedding → N × (Norm + Attention + Norm + FFN) → LM Head — but the interesting variation lives in three places: (1) the attention mechanism (MHA → GQA → MLA), (2) the FFN type (dense → MoE with routing), and (3) the positional encoding (absolute → RoPE). The jump from GPT-2 to Qwen3 captures the attention side; the jump to DeepSeek captures all three at once. DeepSeek-R1 adds a fourth dimension that the block diagram cannot show: the same architecture as V3 but a fundamentally different training pipeline, using reinforcement learning on verifiable rewards to elicit long reasoning chains without human-labelled chain-of-thought.

The table below compares the six models on the dimensions that matter most for deployment: how many parameters are active per token (which determines FLOPs), the attention type (which determines KV cache size), and the context window (which determines how much of that cache you actually fill). The block diagrams that follow let you drill into the equations and parameter counts for each model.

Model	Total	Active	Layers	Hidden	Attention	FFN	Experts	Context	PE
GPT-2	1.5B	1.5B	48	1600	MHA 25H	Dense GELU	—	1K	Absolute
Qwen3-8B	8B	8B	36	4096	GQA 32Q/8KV	Dense SwiGLU	—	40K	RoPE 1M
DeepSeek-V3	671B	37B	61	7168	MLA 128H	MoE SwiGLU	256+1 top-8	163K	YaRN
DeepSeek-R1	671B	37B	61	7168	MLA 128H	MoE SwiGLU	256+1 top-8	163K	YaRN
GPT-OSS-20B	21B	3.6B	24	2880	GQA 64Q/8KV + sliding	MoE SwiGLU	32 top-4	131K	YaRN
GPT-OSS-120B	117B	5.1B	36	2880	GQA 64Q/8KV + sliding	MoE SwiGLU	128 top-4	131K	YaRN

Block diagrams

Click any tab to switch models. Use the + button on each block to expand the equation and parameter details for that sub-layer. Hover over a block for the core equation as a tooltip.

LLM Architecture Block Diagrams

Select a model to see its architecture as a vertical block diagram. Click + on any block to expand equations and parameters.

Token Embeddingvocab 50257 × d 1600

Positional EncodingLearned absolute, ctx 1024

Pre-Norm (LayerNorm)

MHA — Multi-Head Attention25 heads, d_h 64

Post-Attn LayerNorm

Dense FFN — GELUd_ffn 6400 (4×)

LM Headd 1600 → vocab 50257

Embedding / PE / LM HeadAttentionFFN / MoENorm

The key insight from comparing these configs is that MLA (DeepSeek), MoE, and sliding attention each solve a different problem. MLA compresses the KV cache in memory — instead of caching 128 heads × 128 dims per token per layer, it caches a 512-dim latent and decompresses at attention time, cutting KV cache size by 64×. That is the primary innovation that lets DeepSeek-V3 run a 671B model on a manageable number of GPUs. MoE increases total model capacity without increasing active compute — each token routes through only 8 of 256 experts, so 37B parameters fire per token while the model has access to 671B worth of learned associations. Sliding attention (GPT-OSS) controls per-layer context reach: odd layers attend only to a 128-token local window (O(n·128)), even layers attend globally (O(n²)), alternating to blend local syntax with global semantics while halving average attention cost.

For the attention equations behind GQA and MLA — and the KV cache arithmetic — see the attention variants cheatsheet. For what that KV cache means for GPU memory at inference time, including how MLA's compression translates to real memory savings on H100, see the inference GPU architecture post and the KV cache lifecycle post.

Cheatsheet: LLM Architectures

Block diagrams

LLM Architecture Block Diagrams

How to cite this article

Cite this work