Keywords: Multimodal Large Language Models, RL, Spatial Reasoning
TL;DR: We introduce SpatialThinker, a 3D-aware MLLM trained with Dense Spatial Rewards on 7K synthetic VQA samples, achieving double the gains over vanilla RL and matching or surpassing GPT-4o on spatial reasoning tasks.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision–language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets and sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph that captures task-relevant objects and spatial relations, and then reasons via dense spatial reward supervision. SpatialThinker builds on two key innovations: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline across six spatial understanding benchmarks, nearly doubling the base-model gain compared to sparse RL (+6.5\% vs. +3.6\%), and matches or surpasses GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 150
Loading