Measuring Sparse Autoencoder Feature Sensitivity

ICLR 2026 Conference Submission15941 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Mechanistic Interpretability, Automated Interpretability, LLMs, Evaluations
TL;DR: We measure Sparse Autoencoder Feature Sensitivity
Abstract: Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often "monosemantic" and align with human interpretable concepts. However, these examples don't reveal *feature sensitivity*: how reliably a feature activates on texts similar to its activating examples. In this work, we develop a scalable method to evaluate feature sensitivity. Our approach avoids the need to generate natural language descriptions for features; instead, we directly use language models to generate text similar to a feature's activating examples and test whether the feature activates on these inputs. We demonstrate that sensitivity measures a new facet of feature quality and find that many interpretable features have poor sensitivity. Human evaluation confirms that when features fail to activate on our generated text, that text genuinely resembles the original activating examples. Lastly, we study feature sensitivity at the SAE level and observe that average feature sensitivity declines with increasing SAE width across 7 SAE variants. Our work establishes feature sensitivity as a new dimension for evaluating both individual features and SAE architectures.
Primary Area: interpretability and explainable AI
Submission Number: 15941
Loading