Keywords: MLLM, Referring Expression Comprehensions
Abstract: Referring Expression Comprehension (REC) links language to region level visual
perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have
progressed rapidly with multimodal LLMs but remain weak tests of visual rea-
soning and grounding: (i) many expressions are very short, leaving little reason-
ing demand; (ii) images often contain few distractors, making the target easy to
find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine
text understanding and visual reasoning. We introduce Ref-Adv, a modern REC
benchmark that suppresses shortcuts by pairing linguistically nontrivial expres-
sions with only the information necessary to uniquely identify the target. The
dataset contains 5k expressions on real images (1k human authored, 4k human
verified), curated with hard distractors and annotated with reasoning facets includ-
ing negation. We conduct comprehensive ablations (word order perturbations and
descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning
beyond simple cues, and we evaluate a broad suite of contemporary multimodal
LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and Ref-
COCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and
gaps in visual reasoning and grounding. We provide an in depth failure analysis
and aim for Ref-Adv to guide future work on visual reasoning and grounding in
MLLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16232
Loading