Blog Post··8 min read
Memory-Bound vs Compute-Bound: Where LLM Inference Really Spends Its Time
Every LLM operation is either limited by how fast you can move bytes or how fast you can multiply. The roofline model tells you which — and understanding it explains why decode is slow, why batching helps, why prefill is fast, and why Flash Attention exists.