Tag: quantization

Blog Post·2024-06-19·4 min read

KV Cache Memory: Quantization, Eviction, and the Long-Context Problem

The KV cache is the memory bottleneck in LLM inference. As context length grows, it dominates GPU memory. Here's how quantization, eviction policies, and architectural changes manage it.

inference kv-cache quantization memory long-context