Abstract: An image can be described with multiple captions, where the semantic meaning of each caption can be highly different. Depending on the semantics of the caption it may be more challenging to align it to an image. While contrastive learning, the dominant paradigm for aligning image and captions, may prefer easy captions that are semantically overlapping with the image, it is unclear how well contrastively trained vision-language models (VLMs) scale to harder captions. In this work we introduce a dataset with diverse image captions to benchmark a wide-range of VLM across caption difficulty levels. Our findings show that existing VLM struggle with caption diversity, and scale poorly to challenging captions.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: Zhe Gan
Submission Number: 4074
Loading