Tag: kernels

Blog Post·2024-06-20·9 min read

CUDA Fundamentals for ML Engineers

CUDA exposes GPU parallelism through a three-level thread hierarchy: grid, block, and warp. Understanding how these map to hardware — SMs, register files, shared memory — is the prerequisite for writing fast kernels.

gpu cuda kernels inference optimization

Blog Post·2024-06-20·8 min read

Writing Fused Kernels with Triton

Kernel fusion eliminates the HBM round-trips between chained operations. Triton makes this practical in Python. This post builds a fused online softmax from scratch, then extends it to a fused RMSNorm + linear projection — the kind of kernel that actually speeds up LLM inference.

gpu triton kernels fusion inference optimization