Tokens that Know Where: Self-improving 2D Spatial Vocabulary for Multi-modal Understanding

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal large language models, Multimodal Understanding, Vision-Language Models
Abstract: Due to the inherent loss of spatial information caused by token serialization in autoregressive frameworks, modern multimodal large language models (MLLMs) continue to encounter significant challenges in understanding and accurately referencing 2D spatial locations. In this work, we address a critical question: How can sequential tokens create a learnable and robust mapping to continuous 2D spatial positions? We introduce RefWords, a spatial representation that integrates a dedicated vocabulary of learnable tokens into MLLMs. RefWords is featured by two key components: (1) Grid Tokens, which divide the image plane into structured spatial anchors, and (2) Offset Tokens, which allow for detailed, iterative refinement of localization predictions. By embedding spatial relationships directly into the token representation space, RefWords enables MLLMs to perform native 2D reasoning without altering the autoregressive architecture. Extensive experiments demonstrate that RefWords achieves superior performance across various referring tasks in both supervised and reinforcement learning settings. This shows that sequential tokens can effectively represent 2D space when provided with structured representations. This work presents a new paradigm for spatial reasoning in multi-modal systems.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10376
Loading