Keywords: Applications of interpretability
Other Keywords: Vision-Language Prompt Learning
TL;DR: We propose using SAE-based interpretability tools as training guides for vision-language prompt learning, showing that interpretable concept directions can move beyond post-hoc analysis and actively inform model adaptation to downstream tasks..
Abstract: Recent advances in mechanistic interpretability of vision-language models (VLMs) such as CLIP propose using sparse autoencoders (SAEs) to discover monosemantic, human-understandable features that explain CLIP’s internal representations. Existing work using SAEs to probe VLMs primarily focuses on post-hoc interpretability analysis. We posit that SAE-based interpretability methods are not just probing tools, but can also serve as meaningful training guides for adapting VLMs to downstream tasks. To this end, we propose IPL (Interpretability-Guided Prompt Learning), which leverages SAE decoders to extract interpretable concept directions, composes them into prompt tokens via a learnable attention selector, and injects the resulting tokens into both the vision and text encoder layers of CLIP for adaptation. We further study how prompt tokens obtained by probing vision-only, text-only, and unified concept directions from respective interpretability methods affect performance on downstream tasks. We perform extensive experiments across downstream settings such as base-to-novel generalization, domain generalization, cross-dataset transfer, and few-shot learning. While IPL using vision-only and text-only concept directions obtains decent gains, IPL with unified concept directions achieves the strongest results, outperforming most of the prior prompt-learning methods over 15 datasets across all downstream settings.
Submission Number: 428
Loading