Tag: gpu

Blog Post·2024-06-20·9 min read

CUDA Fundamentals for ML Engineers

CUDA exposes GPU parallelism through a three-level thread hierarchy: grid, block, and warp. Understanding how these map to hardware — SMs, register files, shared memory — is the prerequisite for writing fast kernels.

gpu cuda kernels inference optimization

Blog Post·2024-06-20·8 min read

Writing Fused Kernels with Triton

Kernel fusion eliminates the HBM round-trips between chained operations. Triton makes this practical in Python. This post builds a fused online softmax from scratch, then extends it to a fused RMSNorm + linear projection — the kind of kernel that actually speeds up LLM inference.

gpu triton kernels fusion inference optimization

Blog Post·2024-06-19·8 min read

GPU Architecture for LLM Inference

LLM inference is shaped by GPU hardware: HBM bandwidth, SRAM per SM, tensor core throughput, and the roofline that connects them. This post maps the memory hierarchy from HBM to tensor core, shows where decode and prefill sit on the roofline, and explains why FlashAttention exists.

inference gpu hardware systems flash-attention