Referring Expression Comprehension Under Robust, Knowledge-Aware, and Preference-Optimization: A Dataset-Centric Review

Juexi Shao, Yujian Gan, Siyou Li, Massimo Poesio

Published: 15 Dec 2025, Last Modified: 16 Mar 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Bridging linguistic expressions to real-world entities is a fundamental capability for artificial intelligence to understand and interact with its environment. Referring Expression Comprehension (REC) aims to localize target objects in visual scenes based on natural language descriptions and has evolved rapidly alongside advances in multimodal learning. However, the development of REC datasets is increasingly shaped by the heavy reuse of shared visual sources, leading to an apparent diversity that often masks structural limitations in generalization and evaluation. In this survey, we present a comprehensive and up-to-date analysis of recent progress in REC, covering approximately fifty datasets across images, videos, and 3D/RGB-D scan scenes, together with a comparative review of representative models and evaluation protocols. Moreover, we systematically examine emerging research threads, including scene-knowledge-aware grounding, robustness-oriented grounding, and preference-based optimization. Finally, we analyze the practical limitations of current REC research and outline promising future directions toward more robust, scalable, and realistic grounding systems.

External IDs:doi:10.36227/techrxiv.176581153.37368696/v1