Tag: flash-attention

Blog Post·2024-06-19·8 min read

GPU Architecture for LLM Inference

LLM inference is shaped by GPU hardware: HBM bandwidth, SRAM per SM, tensor core throughput, and the roofline that connects them. This post maps the memory hierarchy from HBM to tensor core, shows where decode and prefill sit on the roofline, and explains why FlashAttention exists.

inference gpu hardware systems flash-attention