Multimodal Language Models Cannot Spot Spatial Inconsistencies

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: tiny paper (up to 4 pages)
Keywords: multimodal, spatial intelligence, evaluation, 3d understanding
Abstract: Spatial consistency is a fundamental property of the visual world and a critical requirement for models that aim to understand or generate realistic scenes. Yet, despite their impressive capabilities, today’s multimodal large language models (MLLMs) often fail to reason about 3D geometry across views. We introduce a new task that directly tests this ability: given two views of the same scene, identify which object violates 3D consistency. To create data for this task at scale, we propose a simple, fully automatic method that constructs realistic pairs of inconsistent images from multi-view captures. The method uses object segmentation, inpainting, and cross-view replacement to introduce controlled geometric inconsistencies without manual annotation. Using this approach, we build a dataset and evaluate several state-of-the-art MLLMs, including GPT-5, Gemini 2.5 Pro, and Qwen3 VL 8B. Humans outperform all models by a large margin, revealing that current systems lack robust spatial reasoning. Moreover, fine-tuning an MLLM such as Qwen3 VL 4B on our task not only improves its accuracy and generalization but also enhances performance on other benchmarks like BLINK. Our findings underscore spatial consistency as a key frontier in multimodal reasoning and present a practical framework for advancing geometric understanding in next-generation MLLMs. Our code and benchmark will be available publicly.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 69
Loading