TL;DR: Exploring the Hypothetical 3D Reasoning Capabilities of Foundation Models.
Abstract: The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce *Hypothetical 3D Reasoning*, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at: https://matchlab-imperial.github.io/Hypo3D.
Lay Summary: Imagine a rescue robot navigating a collapsed building. It receives an update: a wall has fallen, opening a new path. A human can imagine this change and plan accordingly without seeing the updated scene. Today’s AI cannot. Most models rely on real time detailed 3D data to reason, which is often unavailable in dynamic unpredictable environments. To address this, we introduce Hypo3D, the first benchmark that evaluates whether AI can reason about a 3D scene using only a description of how it has changed. This mirrors human like hypothetical reasoning, mentally updating a scene before making decisions. Our dataset spans 7727 changes and 14885 question answer pairs across 700 indoor scenes, challenging models to imagine and infer. Our results reveal a striking gap: even state of the art AI models consistently fail when imagination is required. Hypo3D highlights this fundamental limitation and offers a path forward toward AI systems that reason more flexibly, safely, and in a more human like way in complex real world settings.
Link To Code: https://github.com/MatchLab-Imperial/Hypo3D
Primary Area: Deep Learning->Large Language Models
Keywords: 3D Computer Vision, Vision-Language Model, Hypothetical Reasoning
Submission Number: 1052
Loading