Keywords: referring expression comprehension, large language models
TL;DR: Language Models can do Zero-Shot Visual Referring Expression Comprehension
Abstract: The use of visual referring expressions is an important aspect of human-robot in-
teractions. Comprehending referring expressions (ReC) like “the brown cookie near the cup” requires to understand both self-referential expressions, “brown cookie”, and relational referential expressions, “near the cup”. Large pretrained Vision-Language models like CLIP excel at handling self-referential expressions, while struggle with the latter. In this work, we reframe ReC as a language reasoning task and explore whether it can be addressed using large pretrained language models (LLMs), including GPT-3.5 and GPT-4. Given the textual attribute (object category, color, center location, size), GPT-3.5 performs unstably on understanding spatial relationships even with heavy prompt engineering, while GPT-4 shows strong and stable zero-shot relation reasoning. Evaluation on RefCOCO/g datasets and scenarios of interactive robot grasping shows that LLMs can do ReC with decent performance. It suggests a vast potential of using LLMs to enhance the reasoning in vision tasks. The code can be accessed at https://github.com/xiuchao/LLM4ReC.
9 Replies
Loading