Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan; Julian Forsyth; Thomas Fel; Matthew Kowal; Konstantinos G. Derpanis

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce Universal Sparse Autoencoders, a new framework for discovering and aligning interpretable concepts shared across multiple deep neural networks

Abstract: We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation—concepts—across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications—such as coordinated activation maximization—that open avenues for deeper insights in multi-model AI systems.

Lay Summary: Modern computer vision models are increasingly diverse, trained using various datasets and architectures to accomplish specific visual tasks such as depth estimation or object recognition. These design choices shape what visual "concepts" or features each model learns—from recognizing edges and textures to understanding objects and scenes. This raises a core scientific question: do these models, despite their differences, converge on learning the same fundamental visual concepts? Answering this question is challenging because the internal representations these models learn are encoded in ways that humans cannot directly interpret. Our work introduces Universal Sparse Autoencoders (USAE), to create a universal, interpretable concept space that reveals what multiple vision models learn in common about the visual world. Our approach enables us to identify the most important universal concepts shared across models, while also discovering features that are unique to specific models. This analysis provides insight into which architectural and training choices lead to better visual representations, and which concepts appear to be fundamental building blocks for visual understanding. This work advances our ability to understand and compare how different AI systems perceive and process visual information.

Link To Code: https://github.com/YorkUCVIL/UniversalSAE

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: Interpretability, Concepts, Sparse Autoencoders

Submission Number: 1640

Loading