Tag: attention

Blog Post·2024-06-19·6 min read

Attention Variants: MHA, MQA, GQA, and the Memory Math Behind Them

Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.

transformers attention gqa architecture inference