An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.
The KV cache is the memory bottleneck in LLM inference. As context length grows, it dominates GPU memory. Here's how quantization, eviction policies, and architectural changes manage it.
LLM inference has two fundamentally different compute phases. Prefill processes the prompt in parallel and is compute-bound. Decode generates tokens one at a time and is memory-bandwidth-bound. Understanding both determines how you optimize.