Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen; Yuanzhe Liu; Jingyuan Zhu; Xu Cao; Xiaofeng Zhang; Yixiao He; Wenming Ye; James Matthew Rehg; Ismini Lourentzou

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou

Published: 18 Sept 2025, Last Modified: 04 Jan 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Spatial Reasoning, Vision-Language Models, Fine-grained DPO, Long Chain-of-Thought

Abstract: Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 3531

Loading