Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning
Keywords: 3D Referring Reasoning; LLM; Finetuning
Abstract: If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging---it requires the ability to both parse the 3D structure of the scene as well as to correctly ground free-form language in the presence of distraction and clutter.
We propose Transcribe3D, a simple yet effective approach to interpreting 3D referring expressions, which converts 3D scene geometry into a textual representation and takes advantage of the common sense reasoning capability of large language models (LLMs) to make inferences about the objects in the scene and their interactions.
We experimentally demonstrate that employing LLMs in this zero-shot fashion outperforms contemporary methods. We then improve upon the zero-shot version of \acronym by performing finetuning from self-correction in order to generalize. We show preliminary results on the Referit3D dataset with state-of-the-art performance. We also show that our method enables real robots to perform pick-and-place tasks given queries that contain challenging referring expressions.
Submission Number: 38
Loading