S. Roy

Blog Post

Comparing Large Model Architectures: Attention, Normalization, and Scale

GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.

Views: 9 min readCite

Architecture choices aren't cosmetic. The gap between multi-head attention and grouped-query attention is a 4–8× reduction in the KV cache you have to hold in memory at inference. The gap between a dense model and a mixture-of-experts is an active parameter count that runs 10–20% of the total — same FLOPs per token, a much larger model behind them. The gap between rotary positions and learned absolute positions is whether the model can run on a context longer than the one it was trained on at all. Strip away the press releases and the frontier model families differ along a small number of axes, each of which buys a concrete property. Here is how they actually line up, taken from the technical reports.

ModelAttentionNormPos. encodingFFNContextActive paramsArchitecture notes
GPT-2MHAPre-LNLearned absoluteGeLU1024117M–1.5BStandard decoder, no novelties
GPT-3MHAPre-LNLearned absoluteGeLU2048175BGPT-2 at 100× scale
PaLMMHAPre-LN (RMSNorm)RoPESwiGLU2048540BParallel attn+FFN blocks
LLaMA 1MHAPre-LN (RMSNorm)RoPESwiGLU20487B–65BFirst frontier-quality open model
LLaMA 2GQA (70B only)Pre-LN (RMSNorm)RoPESwiGLU40967B–70BGQA only on the 70B
LLaMA 3GQAPre-LN (RMSNorm)RoPESwiGLU81928B–70BGQA across all sizes
Mistral 7BGQAPre-LN (RMSNorm)RoPESwiGLU8192 (SWA)7BSliding window attention
Mixtral 8×7BGQA + MoEPre-LN (RMSNorm)RoPESwiGLU, 8 experts top-232768~13B of 47BSparse MoE, top-2 routing
Gemini 1.5MQA (unconfirmed)Pre-LN (RMSNorm)RoPESwiGLU + MoE1MUnknownNatively multimodal, 1M ctx; attention not disclosed
DeepSeek-V2MLAPre-LN (RMSNorm)RoPESwiGLU + MoE128k~21B of 236BMulti-head latent attention
DeepSeek-V3MLAPre-LN (RMSNorm)RoPESwiGLU + MoE128k~37B of 671BFine-grained MoE + MLA
Qwen2.5GQAPre-LN (RMSNorm)RoPESwiGLU128k0.5B–72BConservative, well-tuned

The table reads as convergence with a few deliberate departures. Five columns have collapsed onto a single answer since roughly 2023 — normalization, position encoding, and the FFN nonlinearity are effectively settled, and attention has converged on a narrow family. The interesting variation lives in the last three columns: how the KV cache is compressed, whether the FFN is dense or sparse, and how far the context stretches.

Attention: the convergence on GQA, and the one real departure

Multi-head attention was the original design, and it has a memory problem that only shows up at serving time. The attention variants post walks through the mechanism in detail; the short version is that the KV cache — the keys and values you must retain for every past token so you don't recompute them — scales with the number of attention heads. At long context and large batch, that cache, not the parameters, is what fills the accelerator.

Multi-query attention (Shazeer, 2019) attacked this by collapsing all query heads onto a single shared key/value head, cutting the cache by the head count. It worked, but it cut too far: a single KV head measurably degraded quality on harder tasks. Grouped-query attention (Ainslie et al., 2023) is the stable middle. Query heads are partitioned into groups, each group sharing one KV head — typically 8 KV heads for 64 query heads — recovering most of MHA's quality at a fraction of the cache. That balance is why GQA, not MQA, is the default across LLaMA 2/3, Mistral, and Qwen.

DeepSeek-V2's multi-head latent attention is the one genuinely novel move in the column. Instead of storing keys and values per head, it caches a single low-rank latent and reconstructs the per-head keys and values from it on the fly. The latent for token tt is a down-projection of that token's hidden state,

ctKV=WDKVht,ctKVRdc,dcnhdh,\mathbf{c}^{KV}_t = W^{DKV}\, \mathbf{h}_t, \qquad \mathbf{c}^{KV}_t \in \mathbb{R}^{d_c}, \quad d_c \ll n_h \cdot d_h,

where WDKVW^{DKV} is the down-projection and dcd_c is the latent dimension, far smaller than the combined per-head key/value width nhdhn_h \cdot d_h. At attention time the per-head keys and values are recovered by up-projecting ctKV\mathbf{c}^{KV}_t, so only the latent has to live in the cache. The cache cost per token drops from 2nhdh2 n_h d_h to roughly dcd_c — a 5–13× reduction relative to MHA — and unlike MQA it pays for that compression with almost no quality loss, because each head still gets its own reconstructed projection rather than sharing one.

Normalization: uniform Pre-LN with RMSNorm since 2023

There is no debate left in this column. Every frontier model trained after 2022 normalizes before the residual branch (Pre-LN) and uses RMSNorm rather than LayerNorm. The normalization post explains why Pre-LN wins — it keeps gradients well-scaled through deep stacks without the warmup gymnastics Post-LN demands — and the RMSNorm explainer covers why dropping the mean-centering and bias terms costs nothing measurable while saving a reduction and a parameter vector per layer. The last notable Post-LN holdouts predate this consensus: the original 2017 Transformer and GPT-1 normalized after the residual branch, but the GPT line switched early — GPT-2 already shipped with Pre-LN, and GPT-3 kept it there. Past that point the column is a single value all the way down.

Positional encoding: RoPE everywhere

Learned absolute position embeddings fix the maximum sequence length at training time — there is simply no embedding for position 4097 if you trained at 4096. Sinusoidal encodings are defined at every position but generalize poorly past the training length in practice. ALiBi extends gracefully but biases attention additively, which interacts awkwardly with some of the cache-compression tricks above. Rotary position embedding (RoPE) rotates the query and key vectors by an angle proportional to position, so relative position falls out of the dot product multiplicatively; it extends to longer contexts through interpolation of the rotation frequencies and composes cleanly with GQA and MLA. Every model in the table uses RoPE except the two original GPT entries.

FFN: SwiGLU at the base, MoE at the top

The feed-forward nonlinearity has settled on SwiGLU — a gated linear unit with a SiLU gate — which replaced GeLU in essentially every post-2022 frontier model; the architecture internals post covers why the gating buys a consistent quality bump for a modest parameter increase. The more consequential split is whether that FFN is dense or sparse. Mixture-of-experts replaces the single FFN with many expert FFNs and a router that sends each token to a small subset. Mixtral routes each token to 2 of 8 experts; DeepSeek-V3 takes it further with fine-grained experts — 256 routed experts with the top-8 activated per token, plus one shared expert that every token always uses to absorb the common-case computation. The payoff is the decoupling that runs through the whole comparison: parameter count drives quality, active parameter count drives cost, and MoE lets you scale the first without the second. A 671B-parameter DeepSeek-V3 runs at the inference cost of a 37B dense model.

Context length: the long-context race

The context column moved faster than any other over this period. GPT-3 and LLaMA-1 sat at 2048; LLaMA-2 doubled to 4096; Mistral-7B reached an effective 8192 through sliding-window attention; LLaMA-3 trained natively at 8192. Then the jump: Gemini 1.5 to a million tokens, DeepSeek-V2/V3 to 128k. That leap from a few thousand to six figures is not one trick but two working together. RoPE scaling methods — positional interpolation, NTK-aware scaling, YaRN — let a model trained at one length operate at a much longer one without retraining from scratch. And cache compression, MLA most of all, makes a 128k context fit in memory at serving time, which is the constraint that actually bites. Long context is as much a KV-cache story as a positional-encoding one.

The multimodal column the table doesn't show

The single biggest architectural fork the table can't capture is how vision and audio enter the model at all. Two designs dominate. Joint-sequence models tokenize every modality into the same stream — Gemini processes image, audio, and video patches as tokens interleaved with text, attended over by the same layers. Adapter models keep a frozen or lightly-tuned language model and bolt on a separate vision tower: GPT-4V routes image features through a vision encoder into the LLM via cross-attention, and LLaMA-3's vision variants attach adapters to the text backbone. The trade is predictable from the structure. Joint models reason across modalities more tightly because the cross-modal interaction happens at every layer; adapter models are far cheaper to train because the expensive language pretraining is reused untouched.

What's actually settled

By 2024 the frontier dense models share almost the same skeleton — Pre-LN, RMSNorm, RoPE, GQA, SwiGLU — to the point where you could describe LLaMA-3, Qwen2.5, and the dense layers of Mistral in one sentence and be right about all three. What separates them is no longer the diagram. It is scale, the quality and quantity of the training data, the compute budget spent, and, for the labs chasing efficiency, two specific levers: how the mixture-of-experts routes its tokens and how the attention compresses its cache. The architecture wars produced a consensus. The engineering wars over data, routing, and memory are the ones still being fought.

Cite this work

Generated from article front matter.

Roy, Swastik. (2024). Comparing Large Model Architectures: Attention, Normalization, and Scale. S. Roy. https://swastikroy.me/blog/large-model-architecture-comparison

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.