CLIP Models Generalize Less Than Compositional Benchmarks Suggest

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: compositional generalization, familiarity shortcut, benchmark evaluation, evaluation protocol, vision-language models
TL;DR: Compositional benchmark scores conflate generalization with memorization of training-set (attribute, object) bindings; our BindSplit protocol disentangles them and reorders existing leaderboards.
Abstract: Compositional benchmarks track progress on CLIP-based compositional reasoning. Each new method reports higher scores than the last, but it is unclear whether the improvements reflect generalization to novel bindings or memorization of bindings already seen during training. To find out, we run two analyses: a synthetic ground-truth study with curated fully-seen, partially-unseen, and fully-unseen binding splits; and extension to three real compositional benchmarks (ARO VG-A, BiVLC, VisMin) using binding-overlap with COCO as a proxy for the alignment-training distribution. On the synthetic dataset, accuracy drops monotonically from fully-seen to fully-unseen across nine CLIP backbones. On ARO VG-A, positive captions overlap COCO bindings nearly twice as often as their attribute-swapped negatives ($79.8\%$ vs.\ $41.8\%$); only $1.2\%$ of samples have no COCO-overlapping bindings. Restricting evaluation to the splits where this asymmetry vanishes by construction (those where all or none of the bindings overlap COCO) reorders the leaderboard. The accuracy drop from seen to unseen bindings broadly replicates on BiVLC and VisMin, though with greater noise. Compositional benchmarks should report performance on these shortcut-free splits; otherwise reported improvements likely overstate how much CLIP has learned to bind.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 117
Loading