Keywords: automatic interpretation, mechanistic interpretability, diffusion model, sparse autoencoder
TL;DR: This paper systematically studies how to automatically generate accurate text labels for visual concepts discovered by sparse autoencoders.
Abstract: Recent progress in mechanistic interpretability and sparse autoencoders (SAEs) has opened new avenues for understanding vision models, yet automatically assigning accurate textual descriptions to discovered concepts remains unprincipled.
Existing studies rely on proxy metrics such as CLIP similarity or qualitative inspection, which fail to measure semantic faithfulness of the concept descriptions.
To bridge this gap, we conduct a principled study of the automatic interpretation pipeline, evaluating key design choices including MLLM query construction and sample selection.
We introduce Semantic Label Quality (SLQ) metrics from language model interpretability to vision, providing direct measurement of label faithfulness.
We further investigate whether synthetic counterfactuals generated by a conditional generative model can further improve interpretation.
Experiments on synthetic faces, histopathology, and remote sensing images reveal that optimal interpretation strategies are dataset-dependent: no single configuration universally outperforms others.
Counterfactual contrastive samples improve interpretation for localized, additive concepts but provide limited benefit for global concepts where counterfactuals are less well defined.
Submission Number: 8
Loading