Blog Post
The Lifecycle of a KV Cache: From Prefill to Last Token
A request-level walkthrough of how the KV cache is populated, grown, and read during LLM inference — covering prefill, decode, memory layout, and why decode is memory-bandwidth-bound.
Views: –5 min readCite
Autoregressive generation produces one token per forward pass, and every pass needs to attend over the entire context produced so far. Done naively, that means recomputing the keys and values for all previous positions on every step — an cost per token, to finish a sequence. The keys and values for a position never change once that position is in the context, so there is no reason to recompute them. The KV cache stores them after the first time they are computed and reads them back on every subsequent step. That single decision splits inference into two phases with completely different performance characters.
Prefill: one big parallel pass
Prefill processes the prompt. Given a prompt of length , you run a single forward pass over all positions at once. At each layer , the key and value projections are dense matrix multiplications against the layer's hidden states :
Both projections cover all positions in one shot, which is why prefill saturates the GPU's matrix units instead of starving them. The resulting and are written into the cache with shape [P, n_heads, head_dim] per layer, and from this point on those numbers are never recomputed. Prefill is compute-bound: you are paying for attention FLOPs and projection FLOPs, and the hardware runs near its roofline because the work is one large batched matmul — and FlashAttention keeps that score matrix in on-chip SRAM rather than round-tripping it through HBM, which is what holds prefill's memory cost linear in instead of quadratic.
Decode: tiny matmul, enormous read
Decode generates the continuation, one token at a time. For new token you compute for that single position only — each of shape [1, n_heads, head_dim] — append the new and to the cache (which now holds positions), and attend the new query against the entire cache:
The query is a single row, so the matmuls are minuscule, but and span every token seen so far and must be streamed out of HBM in full. Decode is therefore memory-bandwidth-bound: the step is dominated not by arithmetic but by the time it takes to move the cache through the memory system.
What the cache actually weighs
The cache is two tensors per layer, and . Take LLaMA-2-70B: 80 layers, 8 key/value heads under grouped-query attention, head dimension 128, stored in fp16. The per-token footprint is
The leading factor of 2 is for and ; the trailing factor of 2 is bytes per fp16 element. At a context of 4096 tokens that is roughly for a single sequence. Long context is expensive for exactly this reason — the cost is linear in sequence length and you pay it again for every concurrent request in the batch.
Why decode lives below the roofline
Consider the arithmetic intensity of a decode step: useful FLOPs divided by bytes moved. Attending one query over a 4096-token cache is on the order of FLOPs, while reading the cache moves bytes. That is an arithmetic intensity of roughly
An A100's ridge point — the intensity at which compute and bandwidth are balanced — sits near 156 FLOPs/byte (312 TFLOP/s fp16 over 2.0 TB/s of HBM bandwidth). At 0.031 FLOPs/byte you are far to the left of that ridge, squarely bandwidth-bound: the matrix units idle while the memory system does all the work.
The prefill/decode gap, and why batching rescues throughput
Prefill is a large parallel matmul with high utilization; decode is a sequence of tiny matmuls each gated on a full cache read, so per-step utilization is dismal. The fix is batching. Decode steps from different sequences read their own caches but share the same weight matrices, and stacking queries turns the per-step projection into a [B, d] × [d, d] matmul — real work for the compute units while the bandwidth cost of the weights is amortized across the batch. This is why serving throughput climbs steeply with batch size up to the point where the combined KV caches exhaust GPU memory. The cache footprint, not the compute, is usually what caps the batch.
When the cache fills
Memory is finite, so a sequence cannot grow its cache forever. The simplest response is a hard maximum sequence length: once the cache reaches it, generation stops. Beyond that, eviction policies trade fidelity for headroom. A sliding window keeps only the most recent tokens' keys and values and drops the rest, bounding the cache at regardless of how long the sequence runs — at the cost of forgetting anything older than the window. Attention-sink schemes keep the first few tokens plus a recent window, exploiting the empirical fact that early positions absorb a disproportionate share of attention mass. More aggressive eviction scores cached entries by accumulated attention weight and discards the least-attended, approximating full attention at a fraction of the memory. Every one of these is the same bet: the cache is the binding constraint, so spend accuracy to buy back bandwidth and space.