DRef: A Benchmark with Diverse Referring Expressions for Object Comprehension of Vision-Language Models

DRef: A Benchmark with Diverse Referring Expressions for Object Comprehension of Vision-Language Models

ICLR 2026 Conference Submission13448 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Referring Expression Comprehension; Visual Grounding

TL;DR: We propose a benchmark DRef and new metrics (Hard Pass Rate, Consistency) for vision language models to evaluate their robustness and consistency when presented with multiple referring expressions of the same object.

Abstract: Referring expression comprehension (REC) tasks challenge vision-language models (VLMs) to locate specific objects within images based on natural-language descriptions, typically by generating bounding boxes or segmentation masks. Existing REC benchmarks suffer from fundamental shortcomings: (1) their limited diversity of referring expressions per object makes it impossible to distinguish whether VLMs truly understand object semantics or simply memorize specific associations; (2) the evaluation metrics do not reveal whether a VLM is robust enough to face complex and diverse referring expressions. We address these issues with a novel benchmark and two innovative metrics. Our benchmark, \textbf{D}iverse \textbf{Ref}erring Expressions for Object Comprehension (\benchmark), encompasses 10,963 meticulously crafted diverse referring expressions for 824 objects spanning 187 categories. Each referred object features an average of 8.3 distinct positive expressions alongside 5.0 negative expressions for non-existent objects. To evaluate model robustness to expression diversity, we propose two complementary metrics: (1) {\metrichard}, which necessitates successful localization across all expressions referring to the same object; and (2) {\metricconsistency}, which quantifies how VLMs generate consistent outputs for expressions describing the same object. Our evaluation reveals that state-of-the-art models struggle with consistent object comprehension. The best model in our assessment, Qwen2.5-VL-72B, attains merely 27.7\% on {\metrichard} and identifies all negative expressions for only 10.1\% of images. {\benchmark} can serve as a rigorous evaluation suite for assessing the robustness of REC models under diverse expressions, and hopefully encourage efforts toward increasing the reliability of REC systems in real-world applications such as robots.

Primary Area: datasets and benchmarks

Submission Number: 13448

Loading