Keywords: Vision-Language Models, Attribution Maps, Information Bottleneck Principle, Variational Inference
TL;DR: This paper proposes a multi-modal information bottleneck (M2IB) attribution method to improve the interpretability of vision-language models.
Abstract: Vision-language pretrained models have seen remarkable success, but their application to high-impact safety-critical settings is limited by their lack of interpretability. To improve the interpretability of vision-language models, we propose a multi-modal information bottleneck (M2IB) objective that compresses irrelevant and noisy information while preserving relevant visual and textual features. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as medical diagnosis. Unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground truth data is available. Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms CAM-based attribution methods both qualitatively and quantitatively.