Abstract: Dual encoder architectures like Clip models map two types of inputs into a shared em-
bedding space and predict similarities between them. Despite their wide application, it is,
however, not understood how these models compare their two inputs. Common first-order
feature-attribution methods explain importances of individual features and can, thus, only
provide limited insights into dual encoders, whose predictions depend on interactions be-
tween features.
In this paper, we first derive a second-order method enabling the attribution of predictions
by any differentiable dual encoder onto feature-interactions between its inputs. Second, we
apply our method to Clip models and show that they learn fine-grained correspondences
between parts of captions and regions in images. They match objects across input modes
and also account for mismatches. This intrinsic visual-linguistic grounding ability, however,
varies heavily between object classes, exhibits pronounced out-of-domain effects and we can
identify individual errors as well as systematic failure categories.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear reviewers and editors,
We have uploaded a revised version of our work addressing suggestions and requests in the reviews. Specifically:
- We have added an experiment evaluating our method’s approximation error as a function of the number of integration steps $N$ in Appendix C.
- This experiment also includes different references for both the image and text encoder and evaluates differences in the resulting attributions.
- We have added a discussion of computational complexity in Appendix I including pseudocode in Algorithm 1.
- We have improved the structure of Figure 1. Figure 22 in the Appendix also shows an alternative illustration and we will appreciate feedback on which version is more intuitive and understandable.
- We have changed parts of the introduction, methods and discussion sections to elaborate more on the relation to integrated gradients, the linearity of the integration path and to improve the intuitive understanding of our method.
- We have added the suggested related work.
Assigned Action Editor: ~Yonatan_Bisk1
Submission Number: 4377
Loading