Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel; Ekdeep Singh Lubana; Jacob S. Prince; Matthew Kowal; Victor Boutin; Isabel Papadimitriou; Binxu Wang; Martin Wattenberg; Demba E. Ba; Talia Konkle

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, Talia Konkle

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Sparse Autoencoders (SAEs) for vision tasks are currently unstable. We introduce Archetypal SAEs (A-SAE and RA-SAE) that constrain dictionary elements within the data’s convex hull.

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the data’s convex hull. This geometric anchoring significantly enhances the stability and plausibility of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true” classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Lay Summary: Neural networks often make decisions using internal representations that are difficult for humans to interpret. One promising approach to explainability is to extract a set of internal “concepts” — directions in the model’s representation space that act like a dictionary the model uses to make sense of the world. These concepts can help us understand what features the model is using, and why it makes certain predictions. However, current methods for building these concept dictionaries are unstable: small changes in the data or random choices during training can lead to completely different explanations. This instability makes it hard to trust or reproduce the results. Our work introduces a new method, Archetypal Sparse Autoencoders, that builds more reliable and interpretable concept dictionaries by geometrically anchoring them to the training data. We also design new evaluation benchmarks to measure whether the learned concepts align with ground truth and remain consistent across training runs. Our approach improves the stability and quality of concept-based explanations in large vision models, helping researchers and practitioners better understand how these systems work — and why.

Link To Code: https://github.com/KempnerInstitute/overcomplete

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: Explainability, Interpretability, Dictionary Learning, Computer Vision, Archetypal Analysis

Submission Number: 5036

Loading