Keywords: Multi-modal Reasoning; Situated Question-Answering; 3D Scene Understanding
Abstract: Situation awareness is essential for understanding and reasoning about 3D scenes
in embodied AI agents. However, existing datasets and benchmarks for situated
understanding suffer from severe limitations in data modality, scope, diversity, and
scale. To address these limitations, we propose Multi-modal Situated Question
Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably
collected leveraging 3D scene graphs and vision-language models (VLMs) across
a diverse range of real-world 3D scenes. MSQA includes 251K situated question
answering pairs across 9 distinct question categories, covering complex scenarios
and object modalities within 3D scenes. We introduce a novel interleaved multi
modal input setting in our benchmark to provide both texts, images, and point
clouds for situation and question description, aiming to resolve ambiguity in
describing situations with single-modality inputs (e.g., texts). Additionally, we
devise the Multi-modal Next-step Navigation (MSNN) benchmark to evaluate
models’ grounding of actions and transitions between situations. Comprehensive
evaluations on reasoning and navigation tasks highlight the limitations of existing
vision-language models and underscore the importance of handling multi-modal
interleaved inputs and situation modeling. Experiments on data scaling and cross
domain transfer further demonstrate the effectiveness of leveraging MSQA as
a pre-training dataset for developing more powerful situated reasoning models,
contributing to advancements in 3D scene understanding for embodied AI.
Submission Number: 695
Loading