Blog Post··11 min read
Sparse Autoencoders: Decomposing Neural Networks into Interpretable Features
Dictionary learning for neural networks — how sparse autoencoders recover monosemantic features from polysemantic activations, and what Anthropic's scaling monosemanticity work found in Claude.