Keywords: ViT, decomposition, mechanistic interpretability
TL;DR: We show how to do representation decomposition and interpretation if your ViT != CLIP
Abstract: Recent works have explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. Components like attention heads and MLPs have been found to capture distinct image features such as shape, color, or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. Thus, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. We also introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViTs (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features.These insights facilitate applications such as image retrieval, visualizing token importance heatmaps, and mitigating spurious correlations.
Submission Number: 48
Loading