Blog Post··9 min read
What Each Transformer Component Actually Does
Attention heads as information-routing circuits, MLP layers as key-value memories, and the residual stream as a shared communication bus.
Attention heads as information-routing circuits, MLP layers as key-value memories, and the residual stream as a shared communication bus.