Tag: inference

Blog Post·2024-06-19·4 min read

Continuous Batching: How vLLM Serves Thousands of Requests

Static batching wastes GPU capacity whenever sequences finish at different times. Continuous batching fixes this by treating the decode loop as a queue — adding new requests the moment a slot opens up.

inference vllm batching systems

Blog Post·2024-06-19·4 min read

KV Cache Memory: Quantization, Eviction, and the Long-Context Problem

The KV cache is the memory bottleneck in LLM inference. As context length grows, it dominates GPU memory. Here's how quantization, eviction policies, and architectural changes manage it.

inference kv-cache quantization memory long-context

Blog Post·2024-06-19·5 min read

Prefill and Decode: The Two Phases of LLM Inference

LLM inference has two fundamentally different compute phases. Prefill processes the prompt in parallel and is compute-bound. Decode generates tokens one at a time and is memory-bandwidth-bound. Understanding both determines how you optimize.

inference llm systems kv-cache

Blog Post·2024-06-19·5 min read

LLM Serving at Scale: Throughput, Latency, and System Design

A single GPU is the easy part. Serving LLMs at production scale involves tensor parallelism, pipeline parallelism, load balancing, SLO enforcement, and hardware heterogeneity. Here's how it fits together.

inference serving systems throughput latency

Blog Post·2024-06-19·5 min read

The Lifecycle of a KV Cache: From Prefill to Last Token

A request-level walkthrough of how the KV cache is populated, grown, and read during LLM inference — covering prefill, decode, memory layout, and why decode is memory-bandwidth-bound.

inference systems llm

Blog Post·2024-06-19·6 min read

Attention Variants: MHA, MQA, GQA, and the Memory Math Behind Them

Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.

transformers attention gqa architecture inference