S. Roy

Blog Post

Cheatsheet: Attention

Every equation in scaled dot-product attention and multi-head attention annotated term-by-term — the scaling, the softmax, the heads, RoPE, and KV cache — with links to the posts explaining each design choice.

Views: 6 min readCite

Self-attention is the core operation of every transformer. This cheatsheet walks the equations from raw input tokens to multi-head output, annotating every design decision and linking to deeper explanations.


Scaled dot-product attention

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Term-by-term

QQ, KK, VV — queries, keys, values

Given an input matrix XRn×dmodelX \in \mathbb{R}^{n \times d_{\text{model}}} (sequence length nn, model dimension dmodeld_{\text{model}}):

Q=XWQ,K=XWK,V=XWVQ = X W^Q, \quad K = X W^K, \quad V = X W^V

where WQ,WKRdmodel×dkW^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k} and WVRdmodel×dvW^V \in \mathbb{R}^{d_{\text{model}} \times d_v} are learned projection matrices.

Why three separate projections? The query, key, and value serve different roles. The query encodes what this token is looking for; the key encodes what this token offers to be found by; the value encodes what information to pass on if found. Keeping them separate allows the model to learn asymmetric matching relationships. See QKOVCircuitDemo.


QKQK^\top — attention scores

S=QKRn×nS = QK^\top \in \mathbb{R}^{n \times n}

Entry SijS_{ij} is the dot product between the query for token ii and the key for token jj — a measure of how relevant token jj is to token ii.

Complexity: Computing all n2n^2 pairs requires O(n2dk)O(n^2 d_k) time and O(n2)O(n^2) memory — the quadratic bottleneck that motivates FlashAttention and sparse attention variants. See FlashAttentionMemory.


1dk\frac{1}{\sqrt{d_k}} — the scaling factor

Scaled scores=QKdk\text{Scaled scores} = \frac{QK^\top}{\sqrt{d_k}}

Why divide by dk\sqrt{d_k}? The dot product qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i. If qiq_i and kik_i are independent with zero mean and unit variance, each term contributes variance 1, so the sum has variance dkd_k. Dividing by dk\sqrt{d_k} restores unit variance — keeping the scores in a range where softmax\text{softmax} remains spread out rather than saturating toward a one-hot distribution. See SinusoidalHeatmap.


softmax()\text{softmax}(\cdot) — attention weights

αij=exp(Sij/dk)k=1nexp(Sik/dk)\alpha_{ij} = \frac{\exp(S_{ij} / \sqrt{d_k})}{\sum_{k=1}^{n} \exp(S_{ik} / \sqrt{d_k})}

The softmax is applied row-wise (each query independently), producing a probability distribution over all nn key positions. αij\alpha_{ij} is the weight token ii assigns to token jj.

Why softmax rather than a hard argmax? Softmax is differentiable — gradients flow through all positions, not just the maximum. It also allows the attention to blend information from multiple tokens simultaneously, which a hard selection could not do.

Temperature and sharpness. Dividing the logits by a temperature τ\tau before softmax controls sharpness: τ<1\tau < 1 sharpens (more selective), τ>1\tau > 1 flattens (more diffuse). The dk\sqrt{d_k} scaling is effectively setting τ=dk\tau = \sqrt{d_k}. See SoftmaxTemperature.


Causal masking (decoder-only)

For autoregressive models, token ii must not attend to future positions j>ij > i:

SijSij+Mij,Mij={0jij>iS_{ij} \leftarrow S_{ij} + M_{ij}, \quad M_{ij} = \begin{cases} 0 & j \leq i \\ -\infty & j > i \end{cases}

Adding -\infty to future positions before softmax maps them to exp()=0\exp(-\infty) = 0, making their contribution exactly zero.

Why add -\infty rather than zero-out the weights post-softmax? Post-softmax zeroing would break the normalisation — the remaining weights would no longer sum to 1. Adding -\infty pre-softmax lets the softmax normalise correctly over only the allowed positions.


VV — value aggregation

Output=αV=softmax ⁣(QKdk)V\text{Output} = \alpha V = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

The output for token ii is the weighted sum of all value vectors, with weights given by the attention distribution. Each output dimension is an independent linear combination — the attention weights select which tokens to aggregate but the value projection determines what information each token contributes.


Multi-head attention

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O headj=Attention(QWjQ,KWjK,VWjV)\text{head}_j = \text{Attention}(Q W^Q_j,\, K W^K_j,\, V W^V_j)

Each head jj has its own projection matrices WjQRdmodel×dkW^Q_j \in \mathbb{R}^{d_{\text{model}} \times d_k}, WjKW^K_j, WjVW^V_j, where dk=dmodel/hd_k = d_{\text{model}} / h.

Why multiple heads? A single attention operation produces one weighted average — it can attend strongly to one relationship at a time. Multiple heads can simultaneously attend to syntactic structure, semantic similarity, positional proximity, and coreference in different representation subspaces. The concatenation then fuses all these views. See AttentionHeadDiagram.

Total parameter count of multi-head attention:

4dmodel2(WQ,WK,WV each dmodel×dmodel, and WO)4 \cdot d_{\text{model}}^2 \quad \text{(} W^Q, W^K, W^V \text{ each } d_{\text{model}} \times d_{\text{model}} \text{, and } W^O \text{)}

Rotary positional encoding (RoPE)

Instead of adding a positional vector to the embedding, RoPE rotates the query and key vectors by an angle proportional to their absolute position before computing dot products:

q~m=Rmqm,k~n=Rnkn\tilde{q}_m = R_m q_m, \qquad \tilde{k}_n = R_n k_n q~mk~n=qmRmRnkn=qmRnmkn\tilde{q}_m^\top \tilde{k}_n = q_m^\top R_m^\top R_n k_n = q_m^\top R_{n-m} k_n

The dot product depends only on the relative position nmn - m, not on absolute positions — the model sees positional information as relative distance, not as absolute indices.

Why relative rather than absolute positions? Language is largely positional relative to context, not absolute. "The dog that bit the man" — the relationship between "dog" and "bit" matters, not which absolute token positions they occupy. Relative encoding also generalises better to lengths unseen during training. See RoPERotationDemo and PEEvolutionComparison.


KV cache

During autoregressive decoding, the keys and values for all past tokens are constant — they do not change as new tokens are generated. Caching them avoids recomputation:

Without cacheWith cache
Recompute K,VK, V for all nn tokens at each stepCompute K,VK, V for the new token only; concatenate with cache
O(n2)O(n^2) per generation stepO(n)O(n) per generation step
O(1)O(1) memory (activations not stored)O(nhdk)O(n \cdot h \cdot d_k) memory per layer

What exactly is cached? The projected key matrix K=XWKK = X W^K and value matrix V=XWVV = X W^V for every layer and every head. The query is recomputed fresh for each new token since it changes at every step. See KVCacheCalculator and KVCacheMemoryGrowth.


Grouped-query attention (GQA)

Standard multi-head attention has one KV head per query head. GQA reduces this to gg KV groups shared across h/gh/g query heads each:

GQA:h query heads,g KV heads,gh\text{GQA}: \quad h \text{ query heads},\quad g \text{ KV heads},\quad g \ll h

Why? KV cache memory scales with the number of KV heads. With h=32h = 32 heads and g=8g = 8 groups, cache memory drops by 4×4\times with minimal quality loss. See GQAHeadComparison.


Complexity summary

QuantityStandard attention
TimeO(n2d)O(n^2 d)
Memory (attention matrix)O(n2)O(n^2)
KV cache per layerO(n2hdk)O(n \cdot 2 \cdot h \cdot d_k)
Parameters4dmodel24 d_{\text{model}}^2

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: Attention. S. Roy. https://swastikroy.me/blog/cheatsheet-attention

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.