ViSAEBench: Cross-Backbone Evaluation of Vision Sparse Autoencoders Reveals Backbone-Dominated Variance and Metric Dissociations
Keywords: Interpretability tooling and software, Methods (probing, steering, causal interventions), Concept Discovery (e.g., SAEs, dictionary learning)
Other Keywords: sparse autoencoders, mechanistic interpretability, vision transformers, SAE evaluation, feature interpretability, cross-architecture analysis, Vision SAE, BatchTopK, JumpReLU, spatial coherence, Moran's I, monosemanticity, feature absorption, concept detection, ImageNet, DINOv2, CLIP, MAE, benchmark
TL;DR: We benchmark 60 SAEs across five ViT backbones and find backbone explains over 90% of variance on most metrics, MAE is categorically undecomposable spatially, and the field's default metrics (FVU, monosemanticity) do not track downstream performance.
Abstract: Sparse autoencoders (SAE) are increasingly used to interpret Vision Transformer
features, but unlike the language setting, there is no standardized protocol for
comparing vision SAEs and no systematic characterization of how SAE quality
depends on the pretrained backbone. We introduce ViSAEBench, a unified
evaluation suite covering seven metrics across four interpretability dimensions,
including a novel spatial coherence metric specific to vision. Using ViSAEBench,
we conduct the first controlled cross-backbone study of vision SAEs: 60
SAEs trained on identical ImageNet-1K activations from five ViT-B backbones
spanning four pretraining paradigms. Our central finding is that the choice of
pretrained backbone dominates vision SAE behavior more than SAE hyperparameters.
A variance decomposition shows that backbone explains over 90\% of variance on
three metrics and over 60\% on five of seven, while SAE hyperparameters dominate
only reconstruction error. The starkest instance is categorical: across all
configurations tested, SAEs trained on Masked Autoencoder features show no
spatial structure beyond chance, while the other four backbones produce strongly
spatially structured features. Single-backbone vision SAE evaluations are
therefore often measuring properties of the backbone more than properties of the
SAE. We further identify two metric-level dissociations with practical
consequences. First, reconstruction error and downstream task preservation substantially diverge across backbones (Spearman $\rho=-0.70$), so reconstruction error alone cannot be used to compare vision SAEs. Second, monosemanticity, a
central SAE quality criterion in language work, does not predict fine-grained
classification, indicating that within-feature consistency does not capture the
between-class separability downstream tasks require. We release all 60 SAE
checkpoints and the ViSAEBench evaluation library.
Submission Number: 250
Loading