The forward pass of a language model is a fixed sequence of matrix multiplications, normalizations, and nonlinearities applied to a token sequence. The high-level structure is identical across all modern decoder-only models — embedding → L L L transformer blocks → final norm → LM head — but the internals of each block differ in exactly the ways that matter: which attention mechanism, which FFN, how many experts, how the position encoding is applied.
This cheatsheet writes out the full forward pass for each model in the order it actually executes, using the exact constants from each model's published config.json. No prose — just equations.
Constants (from config.json · modeling_qwen3.py ):
d = 4096 , L = 36 , h = 32 , h k v = 8 , d h = 128 , d f f = 12288 d = 4096,\quad L = 36,\quad h = 32,\quad h_{kv} = 8,\quad d_h = 128,\quad d_{\mathrm{ff}} = 12288 d = 4096 , L = 36 , h = 32 , h k v = 8 , d h = 128 , d ff = 12288
Input embedding:
X 0 = E [ T ] X^0 = E[T] X 0 = E [ T ]
where T T T is the sequence of token ids and E ∈ R 151936 × 4096 E \in \mathbb{R}^{151936 \times 4096} E ∈ R 151936 × 4096 is the embedding matrix.
For each layer ℓ = 0 , … , 35 \ell = 0, \dots, 35 ℓ = 0 , … , 35 :
Pre-attention norm:
A ℓ = RMSNorm ( X ℓ ) A^\ell = \operatorname{RMSNorm}(X^\ell) A ℓ = RMSNorm ( X ℓ )
QKV projections:
Q = A ℓ W Q , K = A ℓ W K , V = A ℓ W V Q = A^\ell W_Q, \qquad K = A^\ell W_K, \qquad V = A^\ell W_V Q = A ℓ W Q , K = A ℓ W K , V = A ℓ W V
where W Q ∈ R 4096 × 4096 W_Q \in \mathbb{R}^{4096 \times 4096} W Q ∈ R 4096 × 4096 , W K , W V ∈ R 4096 × 1024 W_K, W_V \in \mathbb{R}^{4096 \times 1024} W K , W V ∈ R 4096 × 1024 (32 Q heads, 8 KV heads, d h = 128 d_h = 128 d h = 128 each).
Per-head QK norm (Qwen3-specific):
Q = RMSNorm h e a d ( Q ) , K = RMSNorm h e a d ( K ) Q = \operatorname{RMSNorm}_{\mathrm{head}}(Q), \qquad K = \operatorname{RMSNorm}_{\mathrm{head}}(K) Q = RMSNorm head ( Q ) , K = RMSNorm head ( K )
Rotary positional encoding applied to Q and K:
Q = RoPE ( Q ) , K = RoPE ( K ) Q = \operatorname{RoPE}(Q), \qquad K = \operatorname{RoPE}(K) Q = RoPE ( Q ) , K = RoPE ( K )
GQA head repeat — each of the 8 KV heads is shared by 4 query heads:
K = repeat 4 ( K ) , V = repeat 4 ( V ) K = \operatorname{repeat}_4(K), \qquad V = \operatorname{repeat}_4(V) K = repeat 4 ( K ) , V = repeat 4 ( V )
Scaled dot-product attention with causal mask:
S = Q K ⊤ 128 + M c a u s a l S = \frac{Q K^\top}{\sqrt{128}} + M_{\mathrm{causal}} S = 128 Q K ⊤ + M causal
P = softmax ( S ) P = \operatorname{softmax}(S) P = softmax ( S )
O = P V O = P V O = P V
Output projection:
Attn ( X ℓ ) = O W O , W O ∈ R 4096 × 4096 \operatorname{Attn}(X^\ell) = O\, W_O, \qquad W_O \in \mathbb{R}^{4096 \times 4096} Attn ( X ℓ ) = O W O , W O ∈ R 4096 × 4096
First residual:
Z ℓ = X ℓ + Attn ( X ℓ ) Z^\ell = X^\ell + \operatorname{Attn}(X^\ell) Z ℓ = X ℓ + Attn ( X ℓ )
Pre-MLP norm:
M ℓ = RMSNorm ( Z ℓ ) M^\ell = \operatorname{RMSNorm}(Z^\ell) M ℓ = RMSNorm ( Z ℓ )
SwiGLU MLP:
G = M ℓ W g a t e , U = M ℓ W u p , W g a t e , W u p ∈ R 4096 × 12288 G = M^\ell W_{\mathrm{gate}}, \qquad U = M^\ell W_{\mathrm{up}}, \qquad W_{\mathrm{gate}}, W_{\mathrm{up}} \in \mathbb{R}^{4096 \times 12288} G = M ℓ W gate , U = M ℓ W up , W gate , W up ∈ R 4096 × 12288
H = SiLU ( G ) ⊙ U H = \operatorname{SiLU}(G) \odot U H = SiLU ( G ) ⊙ U
MLP ( Z ℓ ) = H W d o w n , W d o w n ∈ R 12288 × 4096 \operatorname{MLP}(Z^\ell) = H\, W_{\mathrm{down}}, \qquad W_{\mathrm{down}} \in \mathbb{R}^{12288 \times 4096} MLP ( Z ℓ ) = H W down , W down ∈ R 12288 × 4096
Second residual:
X ℓ + 1 = Z ℓ + MLP ( Z ℓ ) X^{\ell+1} = Z^\ell + \operatorname{MLP}(Z^\ell) X ℓ + 1 = Z ℓ + MLP ( Z ℓ )
After all 36 layers:
X f i n a l = RMSNorm ( X 36 ) X_{\mathrm{final}} = \operatorname{RMSNorm}(X^{36}) X final = RMSNorm ( X 36 )
l o g i t s = X f i n a l W l m , W l m ∈ R 4096 × 151936 \mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{4096 \times 151936} logits = X final W lm , W lm ∈ R 4096 × 151936
Qwen3-8B notes
Per-head QK norm is a Qwen3 addition not present in Llama or GPT-2 — each query and key vector is RMSNorm'd per head before RoPE is applied. This stabilises attention logit scale across layers without needing attention softmax temperature tuning. GQA ratio 4:1 — 32 query heads share 8 KV heads, reducing KV cache per layer from 2 × s e q × 32 × 128 2 \times \mathrm{seq} \times 32 \times 128 2 × seq × 32 × 128 to 2 × s e q × 8 × 128 2 \times \mathrm{seq} \times 8 \times 128 2 × seq × 8 × 128 (4× savings). SwiGLU intermediate dim is 12288, which is 3 × d 3 \times d 3 × d — slightly less than the 8 3 d ≈ 3.07 × d \frac{8}{3} d \approx 3.07 \times d 3 8 d ≈ 3.07 × d you'd get from a strict parameter budget match, rounded to a multiple of 256 for hardware efficiency.
Constants (from config.json · modeling_gpt2.py ):
d = 1600 , L = 48 , h = 25 , d h = 64 , d f f = 6400 d = 1600,\quad L = 48,\quad h = 25,\quad d_h = 64,\quad d_{\mathrm{ff}} = 6400 d = 1600 , L = 48 , h = 25 , d h = 64 , d ff = 6400
Input embedding + positional encoding:
X 0 = E [ T ] + P [ t ] X^0 = E[T] + P[t] X 0 = E [ T ] + P [ t ]
where E ∈ R 50257 × 1600 E \in \mathbb{R}^{50257 \times 1600} E ∈ R 50257 × 1600 is the token embedding and P ∈ R 1024 × 1600 P \in \mathbb{R}^{1024 \times 1600} P ∈ R 1024 × 1600 is the learned absolute positional embedding indexed by position t t t .
For each layer ℓ = 0 , … , 47 \ell = 0, \dots, 47 ℓ = 0 , … , 47 :
Pre-attention norm (LayerNorm):
A ℓ = LayerNorm ( X ℓ ) A^\ell = \operatorname{LayerNorm}(X^\ell) A ℓ = LayerNorm ( X ℓ )
QKV — GPT-2 uses a single fused projection then splits:
[ Q , K , V ] = A ℓ W Q K V , W Q K V ∈ R 1600 × 4800 [Q, K, V] = A^\ell W_{QKV}, \qquad W_{QKV} \in \mathbb{R}^{1600 \times 4800} [ Q , K , V ] = A ℓ W Q K V , W Q K V ∈ R 1600 × 4800
Q , K , V ∈ R n × 25 × 64 Q, K, V \in \mathbb{R}^{n \times 25 \times 64} Q , K , V ∈ R n × 25 × 64
No RoPE, no GQA — standard MHA with 25 heads, d h = 64 d_h = 64 d h = 64 .
Scaled dot-product attention with causal mask:
S = Q K ⊤ 64 + M c a u s a l S = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{causal}} S = 64 Q K ⊤ + M causal
P = softmax ( S ) P = \operatorname{softmax}(S) P = softmax ( S )
O = P V O = P V O = P V
Output projection:
Attn ( X ℓ ) = O W O , W O ∈ R 1600 × 1600 \operatorname{Attn}(X^\ell) = O\, W_O, \qquad W_O \in \mathbb{R}^{1600 \times 1600} Attn ( X ℓ ) = O W O , W O ∈ R 1600 × 1600
First residual:
Z ℓ = X ℓ + Attn ( X ℓ ) Z^\ell = X^\ell + \operatorname{Attn}(X^\ell) Z ℓ = X ℓ + Attn ( X ℓ )
Pre-MLP norm (second LayerNorm):
M ℓ = LayerNorm ( Z ℓ ) M^\ell = \operatorname{LayerNorm}(Z^\ell) M ℓ = LayerNorm ( Z ℓ )
GELU MLP (two-matrix, no gating):
H = GELU ( M ℓ W 1 ) , W 1 ∈ R 1600 × 6400 H = \operatorname{GELU}(M^\ell\, W_1), \qquad W_1 \in \mathbb{R}^{1600 \times 6400} H = GELU ( M ℓ W 1 ) , W 1 ∈ R 1600 × 6400
MLP ( Z ℓ ) = H W 2 , W 2 ∈ R 6400 × 1600 \operatorname{MLP}(Z^\ell) = H\, W_2, \qquad W_2 \in \mathbb{R}^{6400 \times 1600} MLP ( Z ℓ ) = H W 2 , W 2 ∈ R 6400 × 1600
where GELU uses the tanh approximation: GELU ( x ) ≈ x ⋅ σ ( 1.702 x ) \operatorname{GELU}(x) \approx x \cdot \sigma(1.702\,x) GELU ( x ) ≈ x ⋅ σ ( 1.702 x ) .
Second residual:
X ℓ + 1 = Z ℓ + MLP ( Z ℓ ) X^{\ell+1} = Z^\ell + \operatorname{MLP}(Z^\ell) X ℓ + 1 = Z ℓ + MLP ( Z ℓ )
After all 48 layers:
X f i n a l = LayerNorm ( X 48 ) X_{\mathrm{final}} = \operatorname{LayerNorm}(X^{48}) X final = LayerNorm ( X 48 )
l o g i t s = X f i n a l W l m , W l m = E ⊤ ∈ R 1600 × 50257 \mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} = E^\top \in \mathbb{R}^{1600 \times 50257} logits = X final W lm , W lm = E ⊤ ∈ R 1600 × 50257
The LM head is weight-tied to the embedding matrix E E E .
GPT-2 notes
Absolute positional embeddings — unlike RoPE, P [ t ] P[t] P [ t ] adds position information at the input once; no per-layer rotation. This caps the effective context window at the training length (1024). Two-matrix FFN — GPT-2 uses W 1 , W 2 W_1, W_2 W 1 , W 2 without a gate branch; SwiGLU replaces this with three matrices (W g a t e , W u p , W d o w n W_{\mathrm{gate}}, W_{\mathrm{up}}, W_{\mathrm{down}} W gate , W up , W down ). No GQA — 25 full Q/K/V heads means KV cache is 2 × s e q × 25 × 64 2 \times \mathrm{seq} \times 25 \times 64 2 × seq × 25 × 64 per layer. LayerNorm not RMSNorm — GPT-2 subtracts mean and divides by std with learned γ , β \gamma, \beta γ , β ; RMSNorm (used by all later models here) drops the mean subtraction.
Constants (from config.json · inference/model.py ):
d = 7168 , L = 61 , h = 128 , d h n o p e = 128 , d h r o p e = 64 d = 7168,\quad L = 61,\quad h = 128,\quad d_h^{\mathrm{nope}} = 128,\quad d_h^{\mathrm{rope}} = 64 d = 7168 , L = 61 , h = 128 , d h nope = 128 , d h rope = 64
d c K V = 512 , d c Q = 1536 , E r o u t e d = 256 , E s h a r e d = 1 , k = 8 , d e x p = 2048 d_c^{KV} = 512,\quad d_c^Q = 1536,\quad E_{\mathrm{routed}} = 256,\quad E_{\mathrm{shared}} = 1,\quad k = 8,\quad d_{\mathrm{exp}} = 2048 d c K V = 512 , d c Q = 1536 , E routed = 256 , E shared = 1 , k = 8 , d exp = 2048
Input embedding:
X 0 = E [ T ] , E ∈ R 129280 × 7168 X^0 = E[T], \qquad E \in \mathbb{R}^{129280 \times 7168} X 0 = E [ T ] , E ∈ R 129280 × 7168
For each layer ℓ = 0 , … , 60 \ell = 0, \dots, 60 ℓ = 0 , … , 60 :
Pre-attention norm:
A ℓ = RMSNorm ( X ℓ ) A^\ell = \operatorname{RMSNorm}(X^\ell) A ℓ = RMSNorm ( X ℓ )
MLA — Multi-head Latent Attention. Compress Q and KV into low-rank latents:
c Q = A ℓ W D Q , c Q ∈ R n × 1536 c^Q = A^\ell W^{DQ}, \qquad c^Q \in \mathbb{R}^{n \times 1536} c Q = A ℓ W D Q , c Q ∈ R n × 1536
c K V = A ℓ W D K V , c K V ∈ R n × 512 c^{KV} = A^\ell W^{DKV}, \qquad c^{KV} \in \mathbb{R}^{n \times 512} c K V = A ℓ W D K V , c K V ∈ R n × 512
Decompress Q — split into NoPE and RoPE heads:
[ Q n o p e , Q r o p e ] = c Q W U Q [Q^{\mathrm{nope}}, Q^{\mathrm{rope}}] = c^Q W^{UQ} [ Q nope , Q rope ] = c Q W U Q
Q r o p e = RoPE ( Q r o p e ) Q^{\mathrm{rope}} = \operatorname{RoPE}(Q^{\mathrm{rope}}) Q rope = RoPE ( Q rope )
Q = [ Q n o p e , Q r o p e ] Q = [Q^{\mathrm{nope}},\; Q^{\mathrm{rope}}] Q = [ Q nope , Q rope ]
Decompress KV — split into NoPE key/value and RoPE key:
[ K n o p e , V ] = c K V W U K V [K^{\mathrm{nope}}, V] = c^{KV} W^{UKV} [ K nope , V ] = c K V W U K V
K r o p e = A ℓ W K R , K r o p e = RoPE ( K r o p e ) K^{\mathrm{rope}} = A^\ell W^{KR}, \qquad K^{\mathrm{rope}} = \operatorname{RoPE}(K^{\mathrm{rope}}) K rope = A ℓ W K R , K rope = RoPE ( K rope )
K = [ K n o p e , K r o p e ] K = [K^{\mathrm{nope}},\; K^{\mathrm{rope}}] K = [ K nope , K rope ]
At inference only c K V c^{KV} c K V and K r o p e K^{\mathrm{rope}} K rope are cached — 512 + 64 = 576 floats per token per layer, vs 128 × 128 × 2 = 32768 128 \times 128 \times 2 = 32768 128 × 128 × 2 = 32768 for standard MHA.
Scaled dot-product attention (128 heads):
S = Q K ⊤ 192 + M c a u s a l S = \frac{Q K^\top}{\sqrt{192}} + M_{\mathrm{causal}} S = 192 Q K ⊤ + M causal
P = softmax ( S ) , O = P V P = \operatorname{softmax}(S), \qquad O = P V P = softmax ( S ) , O = P V
Attn ( X ℓ ) = O W O \operatorname{Attn}(X^\ell) = O\, W_O Attn ( X ℓ ) = O W O
First residual:
Z ℓ = X ℓ + Attn ( X ℓ ) Z^\ell = X^\ell + \operatorname{Attn}(X^\ell) Z ℓ = X ℓ + Attn ( X ℓ )
Pre-FFN norm:
M ℓ = RMSNorm ( Z ℓ ) M^\ell = \operatorname{RMSNorm}(Z^\ell) M ℓ = RMSNorm ( Z ℓ )
FFN — depends on layer depth:
Layers 0–2: dense SwiGLU FFN:
FFN d e n s e ( M ℓ ) = ( SiLU ( M ℓ W g a t e ) ⊙ M ℓ W u p ) W d o w n \operatorname{FFN}_{\mathrm{dense}}(M^\ell) = \bigl(\operatorname{SiLU}(M^\ell W_{\mathrm{gate}}) \odot M^\ell W_{\mathrm{up}}\bigr) W_{\mathrm{down}} FFN dense ( M ℓ ) = ( SiLU ( M ℓ W gate ) ⊙ M ℓ W up ) W down
Layers 3–60: MoE FFN with 1 shared + top-8 of 256 routed experts:
s i = sigmoid ( M ℓ w r , i ) , i = 1 , … , 256 s_i = \operatorname{sigmoid}(M^\ell w_{r,i}), \qquad i = 1, \dots, 256 s i = sigmoid ( M ℓ w r , i ) , i = 1 , … , 256
T = top-8 ( { s i } ) \mathcal{T} = \operatorname{top\text{-}8}(\{s_i\}) T = top - 8 ({ s i })
FFN M o E ( M ℓ ) = FFN s h a r e d ( M ℓ ) + ∑ i ∈ T s i ⋅ FFN i ( M ℓ ) \operatorname{FFN}_{\mathrm{MoE}}(M^\ell) = \operatorname{FFN}_{\mathrm{shared}}(M^\ell) + \sum_{i \in \mathcal{T}} s_i \cdot \operatorname{FFN}_i(M^\ell) FFN MoE ( M ℓ ) = FFN shared ( M ℓ ) + i ∈ T ∑ s i ⋅ FFN i ( M ℓ )
where each expert FFN is a SwiGLU block with d e x p = 2048 d_{\mathrm{exp}} = 2048 d exp = 2048 and scoring uses sigmoid (not softmax) with routed_scaling_factor = 2.5 applied to the top-k k k scores.
Second residual:
X ℓ + 1 = Z ℓ + FFN ( Z ℓ ) X^{\ell+1} = Z^\ell + \operatorname{FFN}(Z^\ell) X ℓ + 1 = Z ℓ + FFN ( Z ℓ )
After all 61 layers:
X f i n a l = RMSNorm ( X 61 ) X_{\mathrm{final}} = \operatorname{RMSNorm}(X^{61}) X final = RMSNorm ( X 61 )
l o g i t s = X f i n a l W l m , W l m ∈ R 7168 × 129280 \mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{7168 \times 129280} logits = X final W lm , W lm ∈ R 7168 × 129280
DeepSeek-V3 notes
MLA KV cache math — standard MHA would cache K , V ∈ R n × 128 × 128 K, V \in \mathbb{R}^{n \times 128 \times 128} K , V ∈ R n × 128 × 128 per layer = 32768n elements. MLA caches c K V ∈ R n × 512 c^{KV} \in \mathbb{R}^{n \times 512} c K V ∈ R n × 512 and K r o p e ∈ R n × 64 K^{\mathrm{rope}} \in \mathbb{R}^{n \times 64} K rope ∈ R n × 64 = 576n elements — a 56× reduction. Sigmoid router — unlike most MoE models that use softmax, DeepSeek-V3 uses sigmoid scores (scoring_func: "sigmoid"), allowing scores to be independent across experts. Aux-loss-free load balancing — bias terms on router logits are adjusted dynamically instead of adding an auxiliary loss to the training objective. MTP — a separate multi-token prediction head predicts t + 1 , t + 2 , … t{+}1, t{+}2, \ldots t + 1 , t + 2 , … as an auxiliary training signal; it is discarded at inference.
Constants (from config.json (120B) · modeling_gpt_oss.py ):
d = 2880 , h = 64 , h k v = 8 , d h = 64 , w = 128 d = 2880,\quad h = 64,\quad h_{kv} = 8,\quad d_h = 64,\quad w = 128 d = 2880 , h = 64 , h k v = 8 , d h = 64 , w = 128
20B: L = 24 , E = 32 120B: L = 36 , E = 128 \text{20B: } L = 24,\; E = 32 \quad\quad \text{120B: } L = 36,\; E = 128 20B: L = 24 , E = 32 120B: L = 36 , E = 128
Input embedding:
X 0 = E [ T ] , E ∈ R 201088 × 2880 X^0 = E[T], \qquad E \in \mathbb{R}^{201088 \times 2880} X 0 = E [ T ] , E ∈ R 201088 × 2880
For each layer ℓ = 0 , … , L − 1 \ell = 0, \dots, L{-}1 ℓ = 0 , … , L − 1 :
Pre-attention norm:
A ℓ = RMSNorm ( X ℓ ) A^\ell = \operatorname{RMSNorm}(X^\ell) A ℓ = RMSNorm ( X ℓ )
QKV projections with attention bias:
Q = A ℓ W Q + b Q , K = A ℓ W K + b K , V = A ℓ W V + b V Q = A^\ell W_Q + b_Q, \qquad K = A^\ell W_K + b_K, \qquad V = A^\ell W_V + b_V Q = A ℓ W Q + b Q , K = A ℓ W K + b K , V = A ℓ W V + b V
where W Q ∈ R 2880 × 4096 W_Q \in \mathbb{R}^{2880 \times 4096} W Q ∈ R 2880 × 4096 , W K , W V ∈ R 2880 × 512 W_K, W_V \in \mathbb{R}^{2880 \times 512} W K , W V ∈ R 2880 × 512 (64Q/8KV heads, d h = 64 d_h = 64 d h = 64 ). Note: GPT-OSS uses attention_bias: true — biases are present in the QKV and output projections.
YaRN-RoPE applied to Q and K (θ = 150000 \theta = 150000 θ = 150000 ):
Q = RoPE ( Q ) , K = RoPE ( K ) Q = \operatorname{RoPE}(Q), \qquad K = \operatorname{RoPE}(K) Q = RoPE ( Q ) , K = RoPE ( K )
GQA head repeat — each of the 8 KV heads is shared by 8 query heads:
K = repeat 8 ( K ) , V = repeat 8 ( V ) K = \operatorname{repeat}_8(K), \qquad V = \operatorname{repeat}_8(V) K = repeat 8 ( K ) , V = repeat 8 ( V )
Attention type alternates by layer:
Even layers (ℓ = 0 , 2 , 4 , … \ell = 0, 2, 4, \dots ℓ = 0 , 2 , 4 , … ) — full causal attention:
S = Q K ⊤ 64 + M c a u s a l S = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{causal}} S = 64 Q K ⊤ + M causal
Odd layers (ℓ = 1 , 3 , 5 , … \ell = 1, 3, 5, \dots ℓ = 1 , 3 , 5 , … ) — sliding window attention, window w = 128 w = 128 w = 128 :
S = Q K ⊤ 64 + M s l i d i n g S = \frac{Q K^\top}{\sqrt{64}} + M_{\mathrm{sliding}} S = 64 Q K ⊤ + M sliding
where M s l i d i n g M_{\mathrm{sliding}} M sliding masks out positions j < t − 128 j < t - 128 j < t − 128 in addition to future positions.
Both layer types:
P = softmax ( S ) , O = P V P = \operatorname{softmax}(S), \qquad O = P V P = softmax ( S ) , O = P V
Attn ( X ℓ ) = O W O + b O \operatorname{Attn}(X^\ell) = O\, W_O + b_O Attn ( X ℓ ) = O W O + b O
First residual:
Z ℓ = X ℓ + Attn ( X ℓ ) Z^\ell = X^\ell + \operatorname{Attn}(X^\ell) Z ℓ = X ℓ + Attn ( X ℓ )
Pre-MLP norm:
M ℓ = RMSNorm ( Z ℓ ) M^\ell = \operatorname{RMSNorm}(Z^\ell) M ℓ = RMSNorm ( Z ℓ )
MoE FFN — router selects top-4 of E E E experts:
s i = softmax ( M ℓ w r , i ) , i = 1 , … , E s_i = \operatorname{softmax}(M^\ell w_{r,i}), \qquad i = 1, \dots, E s i = softmax ( M ℓ w r , i ) , i = 1 , … , E
T = top-4 ( { s i } ) \mathcal{T} = \operatorname{top\text{-}4}(\{s_i\}) T = top - 4 ({ s i })
FFN M o E ( M ℓ ) = ∑ i ∈ T s i ⋅ FFN i ( M ℓ ) \operatorname{FFN}_{\mathrm{MoE}}(M^\ell) = \sum_{i \in \mathcal{T}} s_i \cdot \operatorname{FFN}_i(M^\ell) FFN MoE ( M ℓ ) = i ∈ T ∑ s i ⋅ FFN i ( M ℓ )
Each expert is a clamped SwiGLU block:
G i = M ℓ W g a t e , i , U i = M ℓ W u p , i G_i = M^\ell W_{\mathrm{gate},i}, \qquad U_i = M^\ell W_{\mathrm{up},i} G i = M ℓ W gate , i , U i = M ℓ W up , i
H i = clamp ( SiLU ( G i ) , − 7 , 7 ) ⊙ U i H_i = \operatorname{clamp}\!\bigl(\operatorname{SiLU}(G_i),\; -7,\; 7\bigr) \odot U_i H i = clamp ( SiLU ( G i ) , − 7 , 7 ) ⊙ U i
FFN i ( M ℓ ) = H i W d o w n , i \operatorname{FFN}_i(M^\ell) = H_i\, W_{\mathrm{down},i} FFN i ( M ℓ ) = H i W down , i
where the clamp at ± 7 \pm 7 ± 7 is the swiglu_limit parameter — a training stability constraint specific to GPT-OSS not present in other models.
Second residual:
X ℓ + 1 = Z ℓ + FFN M o E ( Z ℓ ) X^{\ell+1} = Z^\ell + \operatorname{FFN}_{\mathrm{MoE}}(Z^\ell) X ℓ + 1 = Z ℓ + FFN MoE ( Z ℓ )
After all layers:
X f i n a l = RMSNorm ( X L ) X_{\mathrm{final}} = \operatorname{RMSNorm}(X^L) X final = RMSNorm ( X L )
l o g i t s = X f i n a l W l m , W l m ∈ R 2880 × 201088 \mathrm{logits} = X_{\mathrm{final}}\, W_{\mathrm{lm}}, \qquad W_{\mathrm{lm}} \in \mathbb{R}^{2880 \times 201088} logits = X final W lm , W lm ∈ R 2880 × 201088
GPT-OSS notes
Alternating attention — the layer_types list in config.json encodes exactly which layers are "sliding_attention" vs "full_attention", alternating from layer 0. This is structurally different from other MoE models: the attention pattern itself varies per layer, not just the FFN. swiglu_limit: 7.0 — clamps the SiLU gate output to [ − 7 , 7 ] [-7, 7] [ − 7 , 7 ] before the element-wise product. Not present in Qwen3 or DeepSeek. attention_bias: true — GPT-OSS includes bias terms in Q, K, V, and O projections; most models since LLaMA have set attention_bias: false. Active params — with top-4 of 32 experts (20B) or 128 experts (120B), active params per token are ~3.6B and ~5.1B respectively.