Blog Post··8 min read
PagedAttention: Virtual Memory for the KV Cache
Contiguous KV cache allocation wastes GPU memory through fragmentation and over-reservation. PagedAttention fixes this by treating the KV cache as paged virtual memory — small fixed-size blocks assigned on demand, freed immediately, and reused without copying.