Steering CLIP's vision transformer with sparse autoencoders

Published: 31 Mar 2025, Last Modified: 21 Apr 2025MIV at CVPR 2025 (Non-proceedings Track) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, multimodal mechanistic interpretability, sparse autoencoders, interpretability, CLIP, vision transformers
TL;DR: We train sparse autoencoders on CLIP's vision transformer to understand and steer its internal features, achieving state-of-the-art performance in defending against typographic attacks while improving disentanglement on CelebA and Waterbirds.
Abstract: While vision models are highly capable, their internal mechanisms remain poorly understood-- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis of the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks. We release our CLIP SAE models and code to support future research in vision transformer interpretability.
Public: Yes
Submission Number: 15
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview