CUDA exposes GPU parallelism through a three-level thread hierarchy: grid, block, and warp. Understanding how these map to hardware — SMs, register files, shared memory — is the prerequisite for writing fast kernels.
Kernel fusion eliminates the HBM round-trips between chained operations. Triton makes this practical in Python. This post builds a fused online softmax from scratch, then extends it to a fused RMSNorm + linear projection — the kind of kernel that actually speeds up LLM inference.
LLM inference is shaped by GPU hardware: HBM bandwidth, SRAM per SM, tensor core throughput, and the roofline that connects them. This post maps the memory hierarchy from HBM to tensor core, shows where decode and prefill sit on the roofline, and explains why FlashAttention exists.