Tag: paged-attention

Blog Post·2024-06-19·8 min read

PagedAttention: Virtual Memory for the KV Cache

Contiguous KV cache allocation wastes GPU memory through fragmentation and over-reservation. PagedAttention fixes this by treating the KV cache as paged virtual memory — small fixed-size blocks assigned on demand, freed immediately, and reused without copying.

inference vllm paged-attention kv-cache systems