Abstract: Dual encoder architectures like Clip models map two types of inputs into a shared em-
bedding space and predict similarities between them. Despite their wide application, it is,
however, not understood how these models compare their two inputs. Common first-order
feature-attribution methods explain importances of individual features and can, thus, only
provide limited insights into dual encoders, whose predictions depend on interactions be-
tween features.
In this paper, we first derive a second-order method enabling the attribution of predictions
by any differentiable dual encoder onto feature-interactions between its inputs. Second, we
apply our method to Clip models and show that they learn fine-grained correspondences
between parts of captions and regions in images. They match objects across input modes
and also account for mismatches. This intrinsic visual-linguistic grounding ability, however,
varies heavily between object classes, exhibits pronounced out-of-domain effects and we can
identify individual errors as well as systematic failure categories. Code is publicly available:
https://github.com/lucasmllr/exCLIP
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear reviewers and editors,
we have made some final changes and provide the camera-ready version here.
Best,
the authors
Assigned Action Editor: ~Yonatan_Bisk1
Submission Number: 4377
Loading