ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

ACL ARR 2026 January Submission10343 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RoPE, MultiModal Large Language Model

Abstract: Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose \textbf{ID-Align}, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our extensive experiments conducted within the LLaVA-Next framework demonstrate that ID-Align delivers comprehensive improvements: it not only achieves notable quantitative gains across multiple benchmarks, highlighted by a significant enhancement on MMBench’s relation reasoning tasks, but also qualitatively improves the model's attention distribution, making it more interpretable.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: RoPE,MultiModal Large Language Model

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 10343

Loading