Keywords: Embodied Question Answering, Scene Graphs, Vision-Language Models, Modifiable Representations, Spatial Reasoning, 3D Scene Understanding, Inference-time Updates
TL;DR: GraphPad enables vision-language models to dynamically update their 3D scene graphs during inference, improving embodied question answering performance while requiring 5× fewer input frames.
Abstract: Structured scene representations are a core component of embodied agents, helping to consolidate raw sensory streams into interpretable, modular, and searchable formats. Due to their high computational overhead, many approaches build such representations in advance of the task. However, when the task specifications change, such static approaches become inadequate as they may miss key objects, spatial relations, and details. We introduce \textbf{GraphPad}, a modifiable structured memory that an agent can tailor to the needs of the task through API calls. It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes. Together, GraphPad serves as a dynamic workspace that remains complete, current, and aligned with the agent’s immediate understanding of the scene and its task. On the OpenEQA benchmark, GraphPad attains \textbf{55.3\,\%} accuracy—\textbf{+3.0 pp} over an image‑only baseline using the same vision–language model—while operating with \textbf{five times fewer} input frames. These results show that allowing online, language‑driven refinement of 3‑D memory yields more informative representations without extra training or data collection.
Submission Number: 8
Loading