SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

Published: 26 Sept 2023, Last Modified: 16 Jan 2024NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX
Keywords: vision-language models, contrastive training, CLIP, compositional understanding, compositionality, dataset artifacts
TL;DR: We show that existing benchmarks for vision-language compositionality are hackable. We present SugarCrepe, a new benchmark that remedies this vulnerability to faithfully evaluate a vision-language model's compositionality.
Abstract: In the last year alone, a surge of new benchmarks to measure $\textit{compositional}$ understanding of vision-language models have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in $\textit{all}$ these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-language models. To remedy this rampant vulnerability, we introduce $\textit{SugarCrepe}$, a new benchmark for vision-language compositionality evaluation. We employ large language models, instead of rule-based templates used in previous benchmarks, to generate fluent and sensical hard negatives, and utilize an adversarial refinement mechanism to maximally reduce biases. We re-evaluate state-of-the-art models and recently proposed compositionality inducing strategies, and find that their improvements were hugely overestimated, suggesting that more innovation is needed in this important direction. We release $\textit{SugarCrepe}$ and the code for evaluation at: https://github.com/RAIVNLab/sugar-crepe.
Supplementary Material: pdf
Submission Number: 688
Loading