Tag: systems

Blog Post·2024-06-19·4 min read

Continuous Batching: How vLLM Serves Thousands of Requests

Static batching wastes GPU capacity whenever sequences finish at different times. Continuous batching fixes this by treating the decode loop as a queue — adding new requests the moment a slot opens up.

inference vllm batching systems

Blog Post·2024-06-19·5 min read

Prefill and Decode: The Two Phases of LLM Inference

LLM inference has two fundamentally different compute phases. Prefill processes the prompt in parallel and is compute-bound. Decode generates tokens one at a time and is memory-bandwidth-bound. Understanding both determines how you optimize.

inference llm systems kv-cache

Blog Post·2024-06-19·5 min read

LLM Serving at Scale: Throughput, Latency, and System Design

A single GPU is the easy part. Serving LLMs at production scale involves tensor parallelism, pipeline parallelism, load balancing, SLO enforcement, and hardware heterogeneity. Here's how it fits together.

inference serving systems throughput latency

Blog Post·2024-06-19·5 min read

The Lifecycle of a KV Cache: From Prefill to Last Token

A request-level walkthrough of how the KV cache is populated, grown, and read during LLM inference — covering prefill, decode, memory layout, and why decode is memory-bandwidth-bound.

inference systems llm

Blog Post·2024-06-19·5 min read

What Actually Happens to Padding Tokens During LLM Pretraining

Padding wastes GPU compute. Sequence packing eliminates it — but introduces cross-document attention contamination unless you explicitly mask it. Here's what the attention mask actually looks like.

training data systems

Blog Post·2024-06-19·7 min read

RL for Agentic Systems

Single-turn RL teaches a model to produce good responses. Agentic RL teaches it to complete multi-step tasks in an environment — with delayed rewards, partial observability, and real consequences.

rl agents systems llm-training