Blog Post··11 min read
Representation Geometry: How Neural Networks Encode Meaning
The linear representation hypothesis, superposition, polysemanticity, and why transformer activations are more structured than they look.
The linear representation hypothesis, superposition, polysemanticity, and why transformer activations are more structured than they look.
Information theory gives precise answers to questions like: how much does the context tell you about the next token? What information is preserved in a representation? Why does compression and prediction point to the same objective?
MAE predicts pixels. Contrastive methods match views. JEPA predicts representations of target regions from context regions — in an abstract space where irrelevant details have already been discarded.
Supervised learning requires labels. Labels require humans. At scale, that's the bottleneck. Self-supervised learning sidesteps it by constructing supervision from the data itself.