Tag: dictionary-learning

Blog Post·2025-06-20·11 min read

Sparse Autoencoders: Decomposing Neural Networks into Interpretable Features

Dictionary learning for neural networks — how sparse autoencoders recover monosemantic features from polysemantic activations, and what Anthropic's scaling monosemanticity work found in Claude.

mechanistic-interpretability sparse-autoencoders features superposition dictionary-learning