Image Polysemy in Contrastive Vision-Language Learning

TMLR Paper4074 Authors

28 Jan 2025 (modified: 28 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: An image can be described with multiple captions, where the semantic meaning of each caption can be highly different. Depending on the semantics of the caption it may be more challenging to align it to an image. While contrastive learning, the dominant paradigm for aligning image and captions, may prefer easy captions that are semantically overlapping with the image, it is unclear how well contrastively trained vision-language models (VLMs) scale to harder captions. In this work we introduce a dataset with diverse image captions to benchmark a wide-range of VLM across caption difficulty levels. Our findings show that existing VLM struggle with caption diversity, and scale poorly to challenging captions.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: Zhe Gan
Submission Number: 4074
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview