ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

ACL ARR 2026 January Submission10343 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RoPE, MultiModal Large Language Model
Abstract: Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose \textbf{ID-Align}, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our extensive experiments conducted within the LLaVA-Next framework demonstrate that ID-Align delivers comprehensive improvements: it not only achieves notable quantitative gains across multiple benchmarks, highlighted by a significant enhancement on MMBench’s relation reasoning tasks, but also qualitatively improves the model's attention distribution, making it more interpretable.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: RoPE,MultiModal Large Language Model
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 10343
Loading