[Non-archival] Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models

Kaito Watanabe; Taisei Yamamoto; Tomoki Doi; Hitomi Yanaka

[Non-archival] Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models

Kaito Watanabe, Taisei Yamamoto, Tomoki Doi, Hitomi Yanaka

Published: 05 May 2026, Last Modified: 12 May 20264th ALVR PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Multilingual, VLM Evaluation, Spatial reasoning

TL;DR: We developed a benchmark for evaluation of the spatial reasoning ability to use spatial deictic expressions in Vision-Language Models.

Abstract: One of the expected abilities of vision-language models (VLMs) is spatial reasoning ability based on a given text and image. To evaluate the spatial reasoning abilities of VLMs, we focus on the use of spatial deictic expressions, which are defined as spatial expressions whose referent is determined by their situational context, such as this and that. To handle spatial deictic expressions, VLMs must jointly reason over language and visual space, grounding context-dependent references in the image's spatial structure. In addition, selecting appropriate spatial deictic expressions across languages requires VLMs to understand the language-specific spatial distinctions encoded by these expressions. In this paper, we develop a benchmark to evaluate the multilingual ability of VLMs to use spatial deictic expressions in four languages. Our experiments using this benchmark reveal that the tested models use demonstratives in a manner different from that of humans, particularly in selecting the appropriate demonstratives based on the distance from the object.

Submission Number: 40

Loading