RACCOON: Grounding Embodied Question-Answering with State Summaries from Existing Robot Modules

Published: 2025, Last Modified: 30 Jan 2026ICRA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Explainability is vital for establishing user trust, also in robotics. Recently, foundation models (e.g. vision-language models, VLMs) fostered a wave of embodied agents that answer arbitrary queries about their environment and their interactions with it. However, naively prompting VLMs to answer queries based on camera images does not take into account existing robot architectures which represent the robot's tasks, skills, and beliefs about the state of the world. To overcome this limitation, we propose RACCOON, a framework that combines foundation models' responses with a robot's internal knowledge. Inspired by Retrieval-Augmented Generation (RAG), RACCOON selects relevant context, retrieves information from the robot's state, and utilizes it to refine prompts for an LLM to answer questions accurately. This bridges the gap between the model's adaptability and the robot's domain expertise.
Loading