S. Roy

Blog Post

Cheatsheet: LLM Architectures

Five LLM architectures — GPT-2, Qwen3-8B, DeepSeek-V3, DeepSeek-R1, GPT-OSS-20B/120B — shown as interactive block diagrams. Click any block to expand equations and parameters. Each model is sourced from its official HF config.json.

Views: 3 min readCite

Every modern LLM shares the same skeleton — Token Embedding → N × (Norm + Attention + Norm + FFN) → LM Head — but the interesting variation lives in three places: (1) the attention mechanism (MHA → GQA → MLA), (2) the FFN type (dense → MoE with routing), and (3) the positional encoding (absolute → RoPE). The jump from GPT-2 to Qwen3 captures the attention side; the jump to DeepSeek captures all three at once. DeepSeek-R1 adds a fourth dimension that the block diagram cannot show: the same architecture as V3 but a fundamentally different training pipeline, using reinforcement learning on verifiable rewards to elicit long reasoning chains without human-labelled chain-of-thought.

The table below compares the six models on the dimensions that matter most for deployment: how many parameters are active per token (which determines FLOPs), the attention type (which determines KV cache size), and the context window (which determines how much of that cache you actually fill). The block diagrams that follow let you drill into the equations and parameter counts for each model.

ModelTotalActiveLayersHiddenAttentionFFNExpertsContextPE
GPT-21.5B1.5B481600MHA 25HDense GELU1KAbsolute
Qwen3-8B8B8B364096GQA 32Q/8KVDense SwiGLU40KRoPE 1M
DeepSeek-V3671B37B617168MLA 128HMoE SwiGLU256+1 top-8163KYaRN
DeepSeek-R1671B37B617168MLA 128HMoE SwiGLU256+1 top-8163KYaRN
GPT-OSS-20B21B3.6B242880GQA 64Q/8KV + slidingMoE SwiGLU32 top-4131KYaRN
GPT-OSS-120B117B5.1B362880GQA 64Q/8KV + slidingMoE SwiGLU128 top-4131KYaRN

Block diagrams

Click any tab to switch models. Use the + button on each block to expand the equation and parameter details for that sub-layer. Hover over a block for the core equation as a tooltip.

LLM Architecture Block Diagrams

Select a model to see its architecture as a vertical block diagram. Click + on any block to expand equations and parameters.

Token Embeddingvocab 50257 × d 1600
Positional EncodingLearned absolute, ctx 1024
Pre-Norm (LayerNorm)
MHA — Multi-Head Attention25 heads, d_h 64
Post-Attn LayerNorm
Dense FFN — GELUd_ffn 6400 (4×)
LM Headd 1600 → vocab 50257
Embedding / PE / LM HeadAttentionFFN / MoENorm

The key insight from comparing these configs is that MLA (DeepSeek), MoE, and sliding attention each solve a different problem. MLA compresses the KV cache in memory — instead of caching 128 heads × 128 dims per token per layer, it caches a 512-dim latent and decompresses at attention time, cutting KV cache size by 64×. That is the primary innovation that lets DeepSeek-V3 run a 671B model on a manageable number of GPUs. MoE increases total model capacity without increasing active compute — each token routes through only 8 of 256 experts, so 37B parameters fire per token while the model has access to 671B worth of learned associations. Sliding attention (GPT-OSS) controls per-layer context reach: odd layers attend only to a 128-token local window (O(n·128)), even layers attend globally (O(n²)), alternating to blend local syntax with global semantics while halving average attention cost.

For the attention equations behind GQA and MLA — and the KV cache arithmetic — see the attention variants cheatsheet. For what that KV cache means for GPU memory at inference time, including how MLA's compression translates to real memory savings on H100, see the inference GPU architecture post and the KV cache lifecycle post.

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: LLM Architectures. S. Roy. https://swastikroy.me/blog/cheatsheet-architectures

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.