Tag: kv-cache

Blog Post·2024-06-19·3 min read

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

architecture transformers interactive kv-cache moe

Blog Post·2024-06-19·4 min read

KV Cache Memory: Quantization, Eviction, and the Long-Context Problem

The KV cache is the memory bottleneck in LLM inference. As context length grows, it dominates GPU memory. Here's how quantization, eviction policies, and architectural changes manage it.

inference kv-cache quantization memory long-context

Blog Post·2024-06-19·5 min read

Prefill and Decode: The Two Phases of LLM Inference

LLM inference has two fundamentally different compute phases. Prefill processes the prompt in parallel and is compute-bound. Decode generates tokens one at a time and is memory-bandwidth-bound. Understanding both determines how you optimize.

inference llm systems kv-cache