Synthetic Counterfactual World Models for Multimodal Spatial Reasoning in Low-Resource 3D Domains

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsdescription
Keywords: Spatial Reasoning, Counterfactual Reasoning, Multimodal LLMs, 3D Scene Understanding, World Models, Synthetic Data Generation, Physical Plausibility, Evaluation Framework
TL;DR: SCWM generates controlled 3D counterfactual scenes; MLLMs drop 22.9\% on spatial tasks and achieve only 43.8\% on physics violations, revealing reliance on appearance over structured 3D reasoning.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in visual reasoning, yet their understanding of three-dimensional spatial relationships and physical constraints remains limited. Existing benchmarks predominantly evaluate static two-dimensional reasoning and rarely assess structured spatial generalization under counterfactual perturbations. In this work, we introduce Synthetic Counterfactual World Modeling (SCWM), a framework that systematically generates controlled three-dimensional scene variations to probe spatial reasoning and causal understanding in MLLMs. SCWM leverages procedural scene synthesis and physics-aware perturbations to create counterfactual variants of base environments. We evaluate four state-of-the-art MLLMs against two baseline approaches across 2,500 synthetic scenes with 7,500 counterfactual variants. Our findings reveal that current models achieve only 48.3\% accuracy on spatial counterfactual questions compared to 71.2\% on standard spatial tasks, with performance deteriorating to 43.8\% on physically implausible scenarios. We propose evaluation metrics for spatial counterfactual robustness and identify specific failure modes in contemporary architectures.
Previously Accepted: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 1
Loading