S. Roy

Tag: triton

Blog Post··8 min read

Writing Fused Kernels with Triton

Kernel fusion eliminates the HBM round-trips between chained operations. Triton makes this practical in Python. This post builds a fused online softmax from scratch, then extends it to a fused RMSNorm + linear projection — the kind of kernel that actually speeds up LLM inference.