S. Roy

Blog Post

Cheatsheet: LLM Forward Pass Equations

The full forward pass, written out as equations, for GPT-2, Qwen3-8B, DeepSeek-V3, and GPT-OSS. Every matrix, every norm, every residual — in the order the computation actually happens.

Views: 10 min readCite

The forward pass of a language model is a fixed sequence of matrix multiplications, normalizations, and nonlinearities applied to a token sequence. The high-level structure is identical across all modern decoder-only models — embedding → LL transformer blocks → final norm → LM head — but the internals of each block differ in exactly the ways that matter: which attention mechanism, which FFN, how many experts, how the position encoding is applied.

This cheatsheet writes out the full forward pass for each model in the order it actually executes, using the exact constants from each model's published config.json. No prose — just equations.


Qwen3-8B

Constants (from config.json · modeling_qwen3.py):

d=4096,L=36,h=32,hkv=8,dh=128,dff=12288d = 4096,\quad L = 36,\quad h = 32,\quad h_{kv} = 8,\quad d_h = 128,\quad d_{\mathrm{ff}} = 12288

Input embedding:

X0=E[T]X^0 = E[T]

where TT is the sequence of token ids and ER151936×4096E \in \mathbb{R}^{151936 \times 4096} is the embedding matrix.

For each layer =0,,35\ell = 0, \dots, 35:

Pre-attention norm:

A=RMSNorm(X)A^\ell = \operatorname{RMSNorm}(X^\ell)

QKV projections:

Q=AWQ,K=AWK,V=AWVQ = A^\ell W_Q, \qquad K = A^\ell W_K, \qquad V = A^\ell W_V

where WQR4096×4096W_Q \in \mathbb{R}^{4096 \times 4096}, WK,WVR4096×1024W_K, W_V \in \mathbb{R}^{4096 \times 1024} (32 Q heads, 8 KV heads, dh=128d_h = 128 each).

Per-head QK norm (Qwen3-specific):

Q=RMSNormhead(Q),K=RMSNormhead(K)Q = \operatorname{RMSNorm}_{\mathrm{head}}(Q), \qquad K = \operatorname{RMSNorm}_{\mathrm{head}}(K)

Rotary positional encoding applied to Q and K:

Q=RoPE(Q),K=RoPE(K)Q = \operatorname{RoPE}(Q), \qquad K = \operatorname{RoPE}(K)

GQA head repeat — each of the 8 KV heads is shared by 4 query heads:

K=repeat4(K),V=repeat4(V)K = \operatorname{repeat}_4(K), \qquad V = \operatorname{repeat}_4(V)

Scaled dot-product attention with causal mask:

S=QK128+McausalS = \frac{Q K^\top}{\sqrt{128}} + M_{\mathrm{causal}} P=softmax(S)P = \operatorname{softmax}(S) O=PVO = P V

Output projection:

Attn(X)=OWO,WOR4096×4096\operatorname{Attn}(X^\ell) = O\, W_O, \qquad W_O \in \mathbb{R}^{4096 \times 4096}

First residual:

Z=X+Attn(X)Z^\ell = X^\ell + \operatorname{Attn}(X^\ell)

Pre-MLP norm:

M=RMSNorm(Z)M^\ell = \operatorname{RMSNorm}(Z^\ell)

SwiGLU MLP:

G=MWgate,U=MWup,Wgate,WupR4096×12288G = M^\ell W_{\mathrm{gate}}, \qquad U = M^\ell W_{\mathrm{up}}, \qquad W_{\mathrm{gate}}, W_{\mathrm{up}} \in \mathbb{R}^{4096 \times 12288} H=SiLU(G)UH = \operatorname{SiLU}(G) \odot U MLP(Z)=HWdown,WdownR12288×4096\operatorname{MLP}(Z^\ell) = H\, W_{\mathrm{down}}, \qquad W_{\mathrm{down}} \in \mathbb{R}^{12288 \times 4096}

Second residual:

X+1=Z+MLP(Z)X^{\ell+1} = Z^\ell + \operatorname{MLP}(Z^\ell)

After all 36 layers:

Xfinal=RMSNorm(X36)X_{\mathrm{final}} = \operatorname{RMSNorm}(X^{36}) logits=XfinalWlm,WlmR4096×151936\mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{4096 \times 151936}

GPT-2 (1.5B / XL)

Constants (from config.json · modeling_gpt2.py):

d=1600,L=48,h=25,dh=64,dff=6400d = 1600,\quad L = 48,\quad h = 25,\quad d_h = 64,\quad d_{\mathrm{ff}} = 6400

Input embedding + positional encoding:

X0=E[T]+P[t]X^0 = E[T] + P[t]

where ER50257×1600E \in \mathbb{R}^{50257 \times 1600} is the token embedding and PR1024×1600P \in \mathbb{R}^{1024 \times 1600} is the learned absolute positional embedding indexed by position tt.

For each layer =0,,47\ell = 0, \dots, 47:

Pre-attention norm (LayerNorm):

A=LayerNorm(X)A^\ell = \operatorname{LayerNorm}(X^\ell)

QKV — GPT-2 uses a single fused projection then splits:

[Q,K,V]=AWQKV,WQKVR1600×4800[Q, K, V] = A^\ell W_{QKV}, \qquad W_{QKV} \in \mathbb{R}^{1600 \times 4800} Q,K,VRn×25×64Q, K, V \in \mathbb{R}^{n \times 25 \times 64}

No RoPE, no GQA — standard MHA with 25 heads, dh=64d_h = 64.

Scaled dot-product attention with causal mask:

S=QK64+McausalS = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{causal}} P=softmax(S)P = \operatorname{softmax}(S) O=PVO = P V

Output projection:

Attn(X)=OWO,WOR1600×1600\operatorname{Attn}(X^\ell) = O\, W_O, \qquad W_O \in \mathbb{R}^{1600 \times 1600}

First residual:

Z=X+Attn(X)Z^\ell = X^\ell + \operatorname{Attn}(X^\ell)

Pre-MLP norm (second LayerNorm):

M=LayerNorm(Z)M^\ell = \operatorname{LayerNorm}(Z^\ell)

GELU MLP (two-matrix, no gating):

H=GELU(MW1),W1R1600×6400H = \operatorname{GELU}(M^\ell\, W_1), \qquad W_1 \in \mathbb{R}^{1600 \times 6400} MLP(Z)=HW2,W2R6400×1600\operatorname{MLP}(Z^\ell) = H\, W_2, \qquad W_2 \in \mathbb{R}^{6400 \times 1600}

where GELU uses the tanh approximation: GELU(x)xσ(1.702x)\operatorname{GELU}(x) \approx x \cdot \sigma(1.702\,x).

Second residual:

X+1=Z+MLP(Z)X^{\ell+1} = Z^\ell + \operatorname{MLP}(Z^\ell)

After all 48 layers:

Xfinal=LayerNorm(X48)X_{\mathrm{final}} = \operatorname{LayerNorm}(X^{48}) logits=XfinalWlm,Wlm=ER1600×50257\mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} = E^\top \in \mathbb{R}^{1600 \times 50257}

The LM head is weight-tied to the embedding matrix EE.


DeepSeek-V3 (671B MoE)

Constants (from config.json · inference/model.py):

d=7168,L=61,h=128,dhnope=128,dhrope=64d = 7168,\quad L = 61,\quad h = 128,\quad d_h^{\mathrm{nope}} = 128,\quad d_h^{\mathrm{rope}} = 64 dcKV=512,dcQ=1536,Erouted=256,Eshared=1,k=8,dexp=2048d_c^{KV} = 512,\quad d_c^Q = 1536,\quad E_{\mathrm{routed}} = 256,\quad E_{\mathrm{shared}} = 1,\quad k = 8,\quad d_{\mathrm{exp}} = 2048

Input embedding:

X0=E[T],ER129280×7168X^0 = E[T], \qquad E \in \mathbb{R}^{129280 \times 7168}

For each layer =0,,60\ell = 0, \dots, 60:

Pre-attention norm:

A=RMSNorm(X)A^\ell = \operatorname{RMSNorm}(X^\ell)

MLA — Multi-head Latent Attention. Compress Q and KV into low-rank latents:

cQ=AWDQ,cQRn×1536c^Q = A^\ell W^{DQ}, \qquad c^Q \in \mathbb{R}^{n \times 1536} cKV=AWDKV,cKVRn×512c^{KV} = A^\ell W^{DKV}, \qquad c^{KV} \in \mathbb{R}^{n \times 512}

Decompress Q — split into NoPE and RoPE heads:

[Qnope,Qrope]=cQWUQ[Q^{\mathrm{nope}}, Q^{\mathrm{rope}}] = c^Q W^{UQ} Qrope=RoPE(Qrope)Q^{\mathrm{rope}} = \operatorname{RoPE}(Q^{\mathrm{rope}}) Q=[Qnope,  Qrope]Q = [Q^{\mathrm{nope}},\; Q^{\mathrm{rope}}]

Decompress KV — split into NoPE key/value and RoPE key:

[Knope,V]=cKVWUKV[K^{\mathrm{nope}}, V] = c^{KV} W^{UKV} Krope=AWKR,Krope=RoPE(Krope)K^{\mathrm{rope}} = A^\ell W^{KR}, \qquad K^{\mathrm{rope}} = \operatorname{RoPE}(K^{\mathrm{rope}}) K=[Knope,  Krope]K = [K^{\mathrm{nope}},\; K^{\mathrm{rope}}]

At inference only cKVc^{KV} and KropeK^{\mathrm{rope}} are cached — 512 + 64 = 576 floats per token per layer, vs 128×128×2=32768128 \times 128 \times 2 = 32768 for standard MHA.

Scaled dot-product attention (128 heads):

S=QK192+McausalS = \frac{Q K^\top}{\sqrt{192}} + M_{\mathrm{causal}} P=softmax(S),O=PVP = \operatorname{softmax}(S), \qquad O = P V Attn(X)=OWO\operatorname{Attn}(X^\ell) = O\, W_O

First residual:

Z=X+Attn(X)Z^\ell = X^\ell + \operatorname{Attn}(X^\ell)

Pre-FFN norm:

M=RMSNorm(Z)M^\ell = \operatorname{RMSNorm}(Z^\ell)

FFN — depends on layer depth:

Layers 0–2: dense SwiGLU FFN:

FFNdense(M)=(SiLU(MWgate)MWup)Wdown\operatorname{FFN}_{\mathrm{dense}}(M^\ell) = \bigl(\operatorname{SiLU}(M^\ell W_{\mathrm{gate}}) \odot M^\ell W_{\mathrm{up}}\bigr) W_{\mathrm{down}}

Layers 3–60: MoE FFN with 1 shared + top-8 of 256 routed experts:

si=sigmoid(Mwr,i),i=1,,256s_i = \operatorname{sigmoid}(M^\ell w_{r,i}), \qquad i = 1, \dots, 256 T=top-8({si})\mathcal{T} = \operatorname{top\text{-}8}(\{s_i\}) FFNMoE(M)=FFNshared(M)+iTsiFFNi(M)\operatorname{FFN}_{\mathrm{MoE}}(M^\ell) = \operatorname{FFN}_{\mathrm{shared}}(M^\ell) + \sum_{i \in \mathcal{T}} s_i \cdot \operatorname{FFN}_i(M^\ell)

where each expert FFN is a SwiGLU block with dexp=2048d_{\mathrm{exp}} = 2048 and scoring uses sigmoid (not softmax) with routed_scaling_factor = 2.5 applied to the top-kk scores.

Second residual:

X+1=Z+FFN(Z)X^{\ell+1} = Z^\ell + \operatorname{FFN}(Z^\ell)

After all 61 layers:

Xfinal=RMSNorm(X61)X_{\mathrm{final}} = \operatorname{RMSNorm}(X^{61}) logits=XfinalWlm,WlmR7168×129280\mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{7168 \times 129280}

GPT-OSS-20B and GPT-OSS-120B

Constants (from config.json (120B) · modeling_gpt_oss.py):

d=2880,h=64,hkv=8,dh=64,w=128d = 2880,\quad h = 64,\quad h_{kv} = 8,\quad d_h = 64,\quad w = 128 20B: L=24,  E=32120B: L=36,  E=128\text{20B: } L = 24,\; E = 32 \quad\quad \text{120B: } L = 36,\; E = 128

Input embedding:

X0=E[T],ER201088×2880X^0 = E[T], \qquad E \in \mathbb{R}^{201088 \times 2880}

For each layer =0,,L1\ell = 0, \dots, L{-}1:

Pre-attention norm:

A=RMSNorm(X)A^\ell = \operatorname{RMSNorm}(X^\ell)

QKV projections with attention bias:

Q=AWQ+bQ,K=AWK+bK,V=AWV+bVQ = A^\ell W_Q + b_Q, \qquad K = A^\ell W_K + b_K, \qquad V = A^\ell W_V + b_V

where WQR2880×4096W_Q \in \mathbb{R}^{2880 \times 4096}, WK,WVR2880×512W_K, W_V \in \mathbb{R}^{2880 \times 512} (64Q/8KV heads, dh=64d_h = 64). Note: GPT-OSS uses attention_bias: true — biases are present in the QKV and output projections.

YaRN-RoPE applied to Q and K (θ=150000\theta = 150000):

Q=RoPE(Q),K=RoPE(K)Q = \operatorname{RoPE}(Q), \qquad K = \operatorname{RoPE}(K)

GQA head repeat — each of the 8 KV heads is shared by 8 query heads:

K=repeat8(K),V=repeat8(V)K = \operatorname{repeat}_8(K), \qquad V = \operatorname{repeat}_8(V)

Attention type alternates by layer:

Even layers (=0,2,4,\ell = 0, 2, 4, \dots) — full causal attention:

S=QK64+McausalS = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{causal}}

Odd layers (=1,3,5,\ell = 1, 3, 5, \dots) — sliding window attention, window w=128w = 128:

S=QK64+MslidingS = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{sliding}}

where MslidingM_{\mathrm{sliding}} masks out positions j<t128j < t - 128 in addition to future positions.

Both layer types:

P=softmax(S),O=PVP = \operatorname{softmax}(S), \qquad O = P V Attn(X)=OWO+bO\operatorname{Attn}(X^\ell) = O\, W_O + b_O

First residual:

Z=X+Attn(X)Z^\ell = X^\ell + \operatorname{Attn}(X^\ell)

Pre-MLP norm:

M=RMSNorm(Z)M^\ell = \operatorname{RMSNorm}(Z^\ell)

MoE FFN — router selects top-4 of EE experts:

si=softmax(Mwr,i),i=1,,Es_i = \operatorname{softmax}(M^\ell w_{r,i}), \qquad i = 1, \dots, E T=top-4({si})\mathcal{T} = \operatorname{top\text{-}4}(\{s_i\}) FFNMoE(M)=iTsiFFNi(M)\operatorname{FFN}_{\mathrm{MoE}}(M^\ell) = \sum_{i \in \mathcal{T}} s_i \cdot \operatorname{FFN}_i(M^\ell)

Each expert is a clamped SwiGLU block:

Gi=MWgate,i,Ui=MWup,iG_i = M^\ell W_{\mathrm{gate},i}, \qquad U_i = M^\ell W_{\mathrm{up},i} Hi=clamp ⁣(SiLU(Gi),  7,  7)UiH_i = \operatorname{clamp}\!\bigl(\operatorname{SiLU}(G_i),\; -7,\; 7\bigr) \odot U_i FFNi(M)=HiWdown,i\operatorname{FFN}_i(M^\ell) = H_i\, W_{\mathrm{down},i}

where the clamp at ±7\pm 7 is the swiglu_limit parameter — a training stability constraint specific to GPT-OSS not present in other models.

Second residual:

X+1=Z+FFNMoE(Z)X^{\ell+1} = Z^\ell + \operatorname{FFN}_{\mathrm{MoE}}(Z^\ell)

After all layers:

Xfinal=RMSNorm(XL)X_{\mathrm{final}} = \operatorname{RMSNorm}(X^L) logits=XfinalWlm,WlmR2880×201088\mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{2880 \times 201088}

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: LLM Forward Pass Equations. S. Roy. https://swastikroy.me/blog/cheatsheet-forward-pass

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.