Keywords: Out-of-distribution, Sparse Autoencoder, Vision Transformer
Abstract: Sparse Autoencoder (SAE) has recently proven effective for interpretability in large language models by transforming dense hidden states into sparse, semantically meaningful components. In this work, we extend this paradigm to Vision Transformer (ViT), focusing on the [CLS] token, a compact representation that aggregates global image information but is difficult to analyze directly. By training an SAE on [CLS] tokens, we unfold this compressed signal into a sparse latent space that reveals consistent, class-specific activation patterns for in-distribution (ID) data and distinctive deviations for out-of-distribution (OOD) data. To make this structure explicit, we introduce the Class Activation Profiles (CAPs), which rank SAE latent dimensions by their mean activation for each class, providing a class-conditioned reference that can be used to test in OOD detection. These observations highlight that ID samples not only concentrate activation in a small set of dominant features but also preserve a stable rank hierarchy, whereas OOD samples disrupt this structure. Leveraging this insight, we demonstrate that a simple Spearman rank correlation measure can effectively capture the OOD data. This approach yields competitive AUROC scores and achieves a state-of-the-art FPR95 on one dataset while remaining highly competitive on the others. Notably, performance is steady across different OOD benchmarks, indicating robustness. These findings illustrate that the structural invariants revealed by SAE can be transformed into lightweight analytical tools, highlighting their value not just for detection but also for enhancing the transparency and interpretability of ViT feature representations.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11123
Loading