Aligning Vision-Language Models With Human Directional Reference

Daehyun KIM; Hyounghun Kim

Aligning Vision-Language Models With Human Directional Reference

Daehyun KIM, Hyounghun Kim

18 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, spatial reasoning, multimodal learning

TL;DR: We investigate how object frontedness affects spatial reasoning in vision-language modeld demonstrate that incorporating this intrinsic orientation significantly improves their ability to interpret spatial relationships like humans do.

Abstract: Spatial expressions are inherently ambiguous because communicators may adopt different perspectives, making interpretation highly dependent on the chosen frame of reference. Despite recent advances, current vision-language models (VLMs) still struggle to resolve this ambiguity in the absence of a clear reference frame, limiting effective communication between humans and machines. In contrast, humans often overcome this challenge by employing object-centered frames anchored to objects with an intrinsic ‘front’, a property known as frontedness, which determines their orientation and the spatial relationships around them. In this paper, we investigate the feasibility of endowing VLMs with object-centered spatial reasoning abilities, with frontedness as an essential component of the object-centric frame. To this end, we introduce a benchmark of synthetic 3D scenes for systematically evaluating the spatial reasoning of VLMs, and find that they consistently misidentify object orientations and tend to adopt a view-centric perspective. We show that enabling VLMs to perform spatial reasoning from an object-centric perspective achieves better alignment with human behavior.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10865

Loading