Tag: memory-bound

Blog Post·2025-01-17·8 min read

Memory-Bound vs Compute-Bound: Where LLM Inference Really Spends Its Time

Every LLM operation is either limited by how fast you can move bytes or how fast you can multiply. The roofline model tells you which — and understanding it explains why decode is slow, why batching helps, why prefill is fast, and why Flash Attention exists.

inference gpu roofline memory-bound compute-bound arithmetic-intensity performance