ViSAEBench: Cross-Backbone Evaluation of Vision Sparse Autoencoders Reveals Backbone-Dominated Variance and Metric Dissociations

Vijayrajsinh Gohil; Diwei Sheng; Chen Feng

ViSAEBench: Cross-Backbone Evaluation of Vision Sparse Autoencoders Reveals Backbone-Dominated Variance and Metric Dissociations

Vijayrajsinh Gohil, Diwei Sheng, Chen Feng

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability tooling and software, Methods (probing, steering, causal interventions), Concept Discovery (e.g., SAEs, dictionary learning)

Other Keywords: sparse autoencoders, mechanistic interpretability, vision transformers, SAE evaluation, feature interpretability, cross-architecture analysis, Vision SAE, BatchTopK, JumpReLU, spatial coherence, Moran's I, monosemanticity, feature absorption, concept detection, ImageNet, DINOv2, CLIP, MAE, benchmark

TL;DR: We benchmark 60 SAEs across five ViT backbones and find backbone explains over 90% of variance on most metrics, MAE is categorically undecomposable spatially, and the field's default metrics (FVU, monosemanticity) do not track downstream performance.

Abstract: Sparse autoencoders (SAE) are increasingly used to interpret Vision Transformer features, but unlike the language setting, there is no standardized protocol for comparing vision SAEs and no systematic characterization of how SAE quality depends on the pretrained backbone. We introduce ViSAEBench, a unified evaluation suite covering seven metrics across four interpretability dimensions, including a novel spatial coherence metric specific to vision. Using ViSAEBench, we conduct the first controlled cross-backbone study of vision SAEs: 60 SAEs trained on identical ImageNet-1K activations from five ViT-B backbones spanning four pretraining paradigms. Our central finding is that the choice of pretrained backbone dominates vision SAE behavior more than SAE hyperparameters. A variance decomposition shows that backbone explains over 90\% of variance on three metrics and over 60\% on five of seven, while SAE hyperparameters dominate only reconstruction error. The starkest instance is categorical: across all configurations tested, SAEs trained on Masked Autoencoder features show no spatial structure beyond chance, while the other four backbones produce strongly spatially structured features. Single-backbone vision SAE evaluations are therefore often measuring properties of the backbone more than properties of the SAE. We further identify two metric-level dissociations with practical consequences. First, reconstruction error and downstream task preservation substantially diverge across backbones (Spearman $\rho=-0.70$), so reconstruction error alone cannot be used to compare vision SAEs. Second, monosemanticity, a central SAE quality criterion in language work, does not predict fine-grained classification, indicating that within-feature consistency does not capture the between-class separability downstream tasks require. We release all 60 SAE checkpoints and the ViSAEBench evaluation library.

Submission Number: 250

Loading