Keywords: 3D Question Answering, Embodied Question Answering, 3D Scene Understanding
Abstract: Answering questions accurately and efficiently in embodied scenarios presents significant challenges due to limited computational and GPU memory resources. Current embodied systems struggle with the GPU memory overhead of Vision-Language Model (VLM) processing extensive video frames collected during scene exploration. An intuitive solution is to select question-related key frames for VLM inference. Existing key frame selection approaches adopt the visual search-based key frame selection paradigm, which is inefficient since the vision model must infer over every frame for each individual query. In this work, we propose a novel memory tree guided key frame selection paradigm for 3D question answering in embodied scenarios. Our method leverages a compact and reusable 3D scene representation, termed MemTree3D, which supports real-time online construction leveraging camera 6-DoF pose. MemTree3D captures multi-level 3D scene information, enabling a Large Language Model to efficiently query and retrieve question-relevant key frames through our scoring-based frame selection without reprocessing the entire video stream. On OpenEQA, our method improves the accuracy of GPT-4o by 17.4%, achieving state-of-the-art performance and outperforms existing visual search methods in both accuracy and efficiency, demonstrating our work's potential as an effective solution for real-world embodied applications requiring fast and accurate scene understanding. Our code will be released with the final version of the paper.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6333
Loading