Blog Post··6 min read
Attention Variants: MHA, MQA, GQA, and the Memory Math Behind Them
Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.