Position: Use Sparse Autoencoders to Discover Unknowns

Kenny Peng; Rajiv Movva; Jon Kleinberg; Emma Pierson; Nikhil Garg

Position: Use Sparse Autoencoders to Discover Unknowns

Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 viafasttrackPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Interpretability for Knowledge Discovery, Applications of interpretability

TL;DR: We argue that positive SAE results are in the "discover unknowns" regimes and negative results in the "act on knowns" regime. We describe applications of SAEs in the first regime.

Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that even if SAEs may be less effective for *acting on known concepts*, SAEs are especially powerful tools for *discovering unknown concepts*. This distinction separates existing negative results from positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

Submission Number: 350

Loading