Language Models can do Zero-Shot Visual Referring Expression Comprehension

Xiuchao Sui; Shaohua Li; Hong Yang; Hongyuan Zhu; Yan Wu

Language Models can do Zero-Shot Visual Referring Expression Comprehension

Xiuchao Sui, Shaohua Li, Hong Yang, Hongyuan Zhu, Yan Wu

01 Mar 2023 (modified: 01 Jun 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone

Keywords: referring expression comprehension, large language models

TL;DR: Language Models can do Zero-Shot Visual Referring Expression Comprehension

Abstract: The use of visual referring expressions is an important aspect of human-robot in- teractions. Comprehending referring expressions (ReC) like “the brown cookie near the cup” requires to understand both self-referential expressions, “brown cookie”, and relational referential expressions, “near the cup”. Large pretrained Vision-Language models like CLIP excel at handling self-referential expressions, while struggle with the latter. In this work, we reframe ReC as a language reasoning task and explore whether it can be addressed using large pretrained language models (LLMs), including GPT-3.5 and GPT-4. Given the textual attribute (object category, color, center location, size), GPT-3.5 performs unstably on understanding spatial relationships even with heavy prompt engineering, while GPT-4 shows strong and stable zero-shot relation reasoning. Evaluation on RefCOCO/g datasets and scenarios of interactive robot grasping shows that LLMs can do ReC with decent performance. It suggests a vast potential of using LLMs to enhance the reasoning in vision tasks. The code can be accessed at https://github.com/xiuchao/LLM4ReC.

9 Replies

Loading