Blog Post
Cheatsheet: Attention
Every equation in scaled dot-product attention and multi-head attention annotated term-by-term — the scaling, the softmax, the heads, RoPE, and KV cache — with links to the posts explaining each design choice.
Views: –6 min readCite
Self-attention is the core operation of every transformer. This cheatsheet walks the equations from raw input tokens to multi-head output, annotating every design decision and linking to deeper explanations.
Scaled dot-product attention
Term-by-term
, , — queries, keys, values
Given an input matrix (sequence length , model dimension ):
where and are learned projection matrices.
Why three separate projections? The query, key, and value serve different roles. The query encodes what this token is looking for; the key encodes what this token offers to be found by; the value encodes what information to pass on if found. Keeping them separate allows the model to learn asymmetric matching relationships. See QKOVCircuitDemo.
— attention scores
Entry is the dot product between the query for token and the key for token — a measure of how relevant token is to token .
Complexity: Computing all pairs requires time and memory — the quadratic bottleneck that motivates FlashAttention and sparse attention variants. See FlashAttentionMemory.
— the scaling factor
Why divide by ? The dot product . If and are independent with zero mean and unit variance, each term contributes variance 1, so the sum has variance . Dividing by restores unit variance — keeping the scores in a range where remains spread out rather than saturating toward a one-hot distribution. See SinusoidalHeatmap.
— attention weights
The softmax is applied row-wise (each query independently), producing a probability distribution over all key positions. is the weight token assigns to token .
Why softmax rather than a hard argmax? Softmax is differentiable — gradients flow through all positions, not just the maximum. It also allows the attention to blend information from multiple tokens simultaneously, which a hard selection could not do.
Temperature and sharpness. Dividing the logits by a temperature before softmax controls sharpness: sharpens (more selective), flattens (more diffuse). The scaling is effectively setting . See SoftmaxTemperature.
Causal masking (decoder-only)
For autoregressive models, token must not attend to future positions :
Adding to future positions before softmax maps them to , making their contribution exactly zero.
Why add rather than zero-out the weights post-softmax? Post-softmax zeroing would break the normalisation — the remaining weights would no longer sum to 1. Adding pre-softmax lets the softmax normalise correctly over only the allowed positions.
— value aggregation
The output for token is the weighted sum of all value vectors, with weights given by the attention distribution. Each output dimension is an independent linear combination — the attention weights select which tokens to aggregate but the value projection determines what information each token contributes.
Multi-head attention
Each head has its own projection matrices , , , where .
Why multiple heads? A single attention operation produces one weighted average — it can attend strongly to one relationship at a time. Multiple heads can simultaneously attend to syntactic structure, semantic similarity, positional proximity, and coreference in different representation subspaces. The concatenation then fuses all these views. See AttentionHeadDiagram.
Total parameter count of multi-head attention:
Rotary positional encoding (RoPE)
Instead of adding a positional vector to the embedding, RoPE rotates the query and key vectors by an angle proportional to their absolute position before computing dot products:
The dot product depends only on the relative position , not on absolute positions — the model sees positional information as relative distance, not as absolute indices.
Why relative rather than absolute positions? Language is largely positional relative to context, not absolute. "The dog that bit the man" — the relationship between "dog" and "bit" matters, not which absolute token positions they occupy. Relative encoding also generalises better to lengths unseen during training. See RoPERotationDemo and PEEvolutionComparison.
KV cache
During autoregressive decoding, the keys and values for all past tokens are constant — they do not change as new tokens are generated. Caching them avoids recomputation:
| Without cache | With cache |
|---|---|
| Recompute for all tokens at each step | Compute for the new token only; concatenate with cache |
| per generation step | per generation step |
| memory (activations not stored) | memory per layer |
What exactly is cached? The projected key matrix and value matrix for every layer and every head. The query is recomputed fresh for each new token since it changes at every step. See KVCacheCalculator and KVCacheMemoryGrowth.
Grouped-query attention (GQA)
Standard multi-head attention has one KV head per query head. GQA reduces this to KV groups shared across query heads each:
Why? KV cache memory scales with the number of KV heads. With heads and groups, cache memory drops by with minimal quality loss. See GQAHeadComparison.
Complexity summary
| Quantity | Standard attention |
|---|---|
| Time | |
| Memory (attention matrix) | |
| KV cache per layer | |
| Parameters |