Abstract: In this study, we propose an enhancement to the similarity computation mechanism in multimodal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Uri_Shaham1
Submission Number: 9210
Loading