Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Kenny Peng; Rajiv Movva; Jon Kleinberg; Emma Pierson; Nikhil Garg

Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg

23 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: sparse autoencoders, interpretability, explainability, computational social science, text as data

Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs are less effective tools for *acting on known concepts*, SAEs are powerful tools for *discovering unknown concepts*. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) text as data, (ii) bridging prediction and explanation in ML-based science, and (iii) ML interpretability, explainability, fairness, and auditing.

Submission Number: 615

Loading