Aligning Vision-Language Models With Human Directional Reference

18 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, spatial reasoning, multimodal learning
TL;DR: We investigate how object frontedness affects spatial reasoning in vision-language modeld demonstrate that incorporating this intrinsic orientation significantly improves their ability to interpret spatial relationships like humans do.
Abstract: Spatial expressions are inherently ambiguous because communicators may adopt different perspectives, making interpretation highly dependent on the chosen frame of reference. Despite recent advances, current vision-language models (VLMs) still struggle to resolve this ambiguity in the absence of a clear reference frame, limiting effective communication between humans and machines. In contrast, humans often overcome this challenge by employing object-centered frames anchored to objects with an intrinsic ‘front’, a property known as frontedness, which determines their orientation and the spatial relationships around them. In this paper, we investigate the feasibility of endowing VLMs with object-centered spatial reasoning abilities, with frontedness as an essential component of the object-centric frame. To this end, we introduce a benchmark of synthetic 3D scenes for systematically evaluating the spatial reasoning of VLMs, and find that they consistently misidentify object orientations and tend to adopt a view-centric perspective. We show that enabling VLMs to perform spatial reasoning from an object-centric perspective achieves better alignment with human behavior.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10865
Loading