Interactive Object Grounding Using Image-Grounded Scene Graphs and Prompt Chaining

Interactive Object Grounding Using Image-Grounded Scene Graphs and Prompt Chaining

ICLR 2026 Conference Submission19609 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: object grounding, scene graph, prompt chaining, reference resolution, captioning, clarification, dialogue, disambiguation

Abstract: We introduce the task of Interactive Object Grounding, i.e., linking referring expressions in natural language instructions to objects in the physical environment and using clarification to handle ambiguities. Although recent foundation models can be used to perform this task in a straightforward manner, we observe that they tend to generate lengthy and sometimes confusing clarification questions. Moreover, they require many input images to fully cover complex scenes, resulting in high processing costs. Alternative approaches use a scene graph instead of images to represent the environment, but these are inhibited by relying on predefined sets of object properties and spatial relations. Instead of end-to-end VLM prompting with many images, or LLM prompting using a text-only scene graph, we propose a prompt chaining method that utilises multimodal information sampled dynamically from an Image-Grounded Scene Graph (IGSG), leveraging existing LLMs/VLMs to perform object grounding and clarification question generation more effectively. Evaluations based on 3D scenes from ScanNet show that the proposed method outperforms an end-to-end baseline that does not use a scene graph, at only 35% of the cost. Furthermore, it achieves substantial improvements in grounding F-score through clarification, both with our simulated user (up to 34% gain) and with human subjects (up to 23.6% gain).

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19609

Loading