Worse Together: Understanding the Brittleness of Multimodal Models on Rare Concept Pairs

ICLR 2026 Conference Submission23636 Authors

20 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal models, robustness, CLIP, vision-language models, distribution shifts
Abstract: Multimodal models are being deployed in real-world settings where rare or unseen combinations of objects during pretraining are bound to appear at test time. Understanding how these models generalize to rare combinations of concepts is thus an important robustness problem. In this paper, we investigate how the pairwise co-occurrence of concepts in the pretraining dataset impacts CLIP and large multimodal model (LMM) performance on uncommon concept pairs. We measure concept co-occurrence with pointwise mutual information (PMI), which corrects for the correlation between single and paired concept frequencies. We show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M ($r=0.97$ and 14\% accuracy gap between images in the top and bottom 5\% of PMI values), and demonstrate that a simple PMI-based image edit can induce an accuracy drop of up to 10\% on images edited to contain low PMI pairs. We additionally find that this behavior in CLIP transfers to LMMs built on top of CLIP ($r=0.70$ for TextVQA, $r=0.62$ for VQAv2). Finally, we demonstrate that fine-tuning CLIP with augmented data covering a broad range of PMI values is a promising strategy to improve robustness on rare concept pairs.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 23636
Loading