Blog Post··12 min read
Open Problems in Mechanistic Interpretability
Faithfulness vs. plausibility, scaling to frontier models, the composition problem, automated interpretability, and what it would take to actually understand a large language model.