Blog Post
GPU Architecture for LLM Inference
LLM inference is shaped by GPU hardware: HBM bandwidth, SRAM per SM, tensor core throughput, and the roofline that connects them. This post maps the memory hierarchy from HBM to tensor core, shows where decode and prefill sit on the roofline, and explains why FlashAttention exists.
Views: –8 min readCite
Every performance number in the previous parts of this series — the 7 ms decode floor, the 156 FLOP/byte ridge point, the reason FlashAttention matters — traces back to the same physical object: a GPU whose memory is organised in levels of wildly different size, bandwidth, and latency. Understanding that hierarchy is not optional background; it is the reason the serving stack is built the way it is.
The H100 at a glance
The H100 SXM5 has:
- 80 GB HBM3 at 3.35 TB/s — the main memory where weights and KV caches live
- 50 MB L2 cache on-chip, roughly 12 TB/s effective
- 132 Streaming Multiprocessors (SMs), each with 228 KB of programmable SRAM
- 4 tensor cores per SM executing 16×16×16 matrix-multiply-accumulate in ~8 cycles
- Peak compute: 989 TFLOP/s fp16 (without sparsity)
The ridge point — the arithmetic intensity where the roofline transitions from bandwidth-bound to compute-bound — is . Any workload with intensity below 295 leaves compute idle and is bottlenecked on memory bandwidth. Any workload above it leaves bandwidth idle and is bottlenecked on the tensor cores.
The memory hierarchy: five levels
GPU Memory Hierarchy (H100)
Click any level to see what it stores, its size, bandwidth, and role in LLM inference.
Working from largest to fastest:
HBM is where everything starts. Model weights in fp16 cost 2 bytes per parameter — a 70B model occupies 140 GB, which already requires two H100s. Every decode step reads these weights once to compute the single-token forward pass. With 3.35 TB/s and 140 GB, the lower bound on a decode step is per batch element. Batching amortises this: with a batch of 32, the same weight read produces 32 tokens, dropping the per-token cost to .
L2 cache (50 MB on H100) sits between HBM and the SMs. For small models, projections that are reused across many SM threads can live here rather than being re-fetched from HBM — effectively multiplying the apparent bandwidth. But at 50 MB it cannot hold the full weight set of anything larger than a few hundred million parameters.
SRAM (228 KB per SM) is the critical resource for attention. Unlike HBM, SRAM is on-chip and accessible by all threads within an SM at ~19 TB/s with ~30-cycle latency. The tension is its size: 228 KB is not much when an attention head has keys and values for thousands of tokens.
Registers (65,536 × 32-bit per SM, shared across threads) hold the live operands for the tensor core. When a kernel demands too many registers per thread, the SM fits fewer concurrent warps — register pressure directly governs occupancy, which governs how much instruction-level parallelism is available to hide memory latency.
Tensor cores are the throughput engine. Each SM has 4; the whole chip has 528. They execute a 16×16×16 MMA — two 16×16 input tiles (A and B, in fp16 or bf16) against a 16×16 fp32 accumulator — in a few cycles. The peak throughput only materialises if operands arrive in registers on time. A tensor core blocked on a register load is a tensor core producing zero FLOP/s.
Where decode and prefill sit on the roofline
Roofline: Decode vs Prefill
Arithmetic intensity (FLOP/byte) places each workload on the roofline. Decode is memory-bound; prefill is compute-bound. Batch size shifts the decode point rightward.
Decode has arithmetic intensity equal to the batch size. For a model with parameters in fp16:
At batch 1, intensity is 1 FLOP/byte — four orders of magnitude below the H100's ridge point of 295. The tensor cores sit idle 99.7% of the time; the step is gated entirely on how fast HBM can ship weights to the SMs. Increasing batch size shifts the decode point rightward on the roofline. You cross into compute-bound territory only at , well beyond what the KV cache budget allows for large models.
Prefill has intensity proportional to the sequence length :
A 1024-token prefill has intensity 1024 FLOP/byte — past the ridge on both A100 and H100. Prefill is compute-bound, which is why adding GPUs or tensor parallelism helps it and why quantisation (which reduces the byte count denominator) helps decode more than prefill.
Why FlashAttention exists
The naive implementation of attention computes by materialising the full score matrix in HBM:
- Write to HBM — bytes written
- Read back from HBM to apply softmax row-wise — bytes read
- Write softmax output to HBM — bytes written
- Read and from HBM to compute output — bytes read
For and , the score matrix alone is — larger than the entire L2 cache. Every element is written to HBM and read back twice. This is the memory access pattern that makes attention the bottleneck for long contexts.
FlashAttention: SRAM Tiling
FlashAttention loops over Q tiles (outer) and K/V tiles (inner). At each step only the current tiles are in SRAM — the N² score matrix is never materialised in HBM. Step through to see the tile schedule.
FlashAttention (Dao et al., 2022) reorders the computation to keep tiles in SRAM throughout:
- Divide into outer tiles of size rows; divide and into inner tiles of size columns.
- For each outer tile , load it into SRAM once.
- For each inner tile , : load into SRAM, compute , update the running softmax statistics (online softmax), accumulate into the output tile .
- Write to HBM once when the inner loop finishes.
The score matrix is never written to HBM. HBM traffic drops from to — linear rather than quadratic. The tile sizes are chosen to fit , , , and simultaneously in SRAM:
where is the SRAM capacity. With 228 KB SRAM and , tiles of roughly 450 rows fit — comfortably covering typical head dimensions. The result is that attention, previously one of the most bandwidth-hungry operations in the forward pass, becomes compute-bound for long sequences.
SM compute pipeline: loading a tile
SM Compute Pipeline — One MMA Step
Step through the pipeline to see how one matrix-multiply tile flows through the SM.
The SM is waiting for work. Warps are stalled — no data in flight yet.
A single matrix-multiply tile flows through the SM in six stages. The time budget is dominated by the HBM load — a 16×16 fp16 tile is 512 bytes, and at 3.35 TB/s each SM can fetch one in . But the SM has 132 threads per warp and potentially dozens of warps in flight, all issuing loads; the effective per-warp bandwidth is a fraction of the peak, and the scheduling of those loads — latency hiding through warp switching — is what keeps the tensor cores fed.
The pipeline is:
For a bandwidth-bound kernel like single-sequence decode, the HBM load dominates. For a compute-bound kernel like large-batch prefill, the tensor core is the constraint and the memory pipeline must overlap loads for the next tile with the MMA for the current one — the software pipelining that CUTLASS and Triton handle automatically.
What this means for the serving stack
The memory hierarchy imposes three constraints that the entire serving stack is built around:
Decode is HBM-bandwidth-bound for any realistic batch. The weight read is the floor. Every technique that shrinks the weight volume — fp8 quantisation, speculative decoding that cashes multiple tokens from one read, GQA that reduces the KV head count — is an attack on this floor. Every technique that keeps the batch large — continuous batching, careful memory management — moves the decode point rightward on the roofline.
Attention is SRAM-constrained for long contexts. The N² HBM traffic of naive attention is what FlashAttention eliminates by tiling into SRAM. Grouped-query attention (GQA) and multi-query attention (MQA) reduce the K and V head count, shrinking both the SRAM footprint per tile and the HBM KV cache.
Tensor core utilisation requires register discipline. The peak 989 TFLOP/s is only available if operands arrive in registers without stalling. Occupancy — the number of warps an SM can hold simultaneously — is the lever: more warps mean more instruction-level parallelism to hide latency, but more registers per thread mean fewer warps. Kernel writers (and the compilers inside Triton/CUTLASS) spend more effort on this pipeline than on the arithmetic itself.
The next post works through the other structural gap in the series: how vLLM's PagedAttention manages the KV cache as a paged virtual address space, eliminating the fragmentation that would otherwise prevent the large batches the roofline demands.