Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew; Hubert Baniecki; Przemyslaw Biecek

Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0

TL;DR: Matryoshka sparse autoencoder outperforms ReLU and TopK SAE, enhancing the interpretability of vision-language transformers through hierarchical concept learning.

Abstract: Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern large-scale systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining 80\% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.

Lay Summary: Modern multimodal AI models like CLIP exhibit remarkable performance in linking visual and textual information. Yet, they remain largely opaque-operating as "black boxes" with limited transparency into their internal representations. This lack of interpretability presents a critical barrier to responsible deployment and scientific understanding. To address this, we introduce Matryoshka Sparse Autoencoder (MSAE), a novel method aimed to understand and control internal representation. MSAE is designed to recover internal features at multiple levels of abstraction simultaneously, preserving both the semantic clarity and compactness of the feature representation. Applied to CLIP, our method achieves over 99% cosine similarity to the original activations while maintaining high sparsity (~80%), thereby reconciling the typical trade-off between expressiveness and simplicity. MSAE supports fine-grained inspection of model behavior. In our evaluation, we identified over 120 distinct and interpretable concepts encoded by CLIP. These features support downstream applications including concept-based retrieval, systematic bias analysis, and targeted probing.

Link To Code: https://github.com/WolodjaZ/MSAE

Primary Area: Deep Learning->Other Representation Learning

Keywords: SAE, CLIP, Interpretability, Matryoshka Representation Learning

Flagged For Ethics Review: true

Submission Number: 6572

Loading