Tracing Concept Circuits to Audit and Steer Vision Transformers

01 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretable Machine Learning, Representation Learning, Sparse Autoencoders
Abstract: Advanced vision models, e.g., Vision Transformers (ViTs), might base their decisions on spurious cues, even for correct predictions. To ensure their safe deployment in high-stakes applications, it is essential to audit ViT decision-making processes and steer them away from unsafe predictions. Traditional interpretation methods typically attribute predictions to salient pixels or neurons. However, such simplified correlations often overlook the concepts encoded in internal representations, which can be the true causes of failures. To this end, we develop an interpretation toolbox, ViSAE, to trace the concept circuits from ViT representations. These circuits enable users to (i) audit models by identifying spurious shortcuts, and (ii) steer model behaviors by amplifying or suppressing specific concepts along influential paths. Specifically, we construct a neuroscience-motivated probing suite (63K images and 16K concepts) that mirrors the human visual cortex hierarchy. Building upon the data, we train Sparse Autoencoders (SAEs) to read concepts directly from the representations of ViT and trace their causal relationships. Extensive experiments and ablation studies show that our probing suite outperforms existing counterparts by 20$\times$ in concept coverage efficiency and 28.7\% in interpretation accuracy. We demonstrate that using ViSAE, we can identify spurious decision paths, localize concepts on pixels, and diagnose the model failure modes. Furthermore, our toolbox enables model steering by editing concepts within representations, which improves worst-group accuracy on the WaterBirds dataset by 48.2%.
Primary Area: interpretability and explainable AI
Submission Number: 636
Loading