ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Abstract: The rapid advancement of Vision-Language Models (VLMs) has driven researchers to increase image token counts through dynamic high-resolution strategies to enhance the capabilities of VLMs, typically involving image upscaling, grid-based cropping, and joint encoding of multi-resolution patches. Although this approach enriches visual detail, it inadvertently introduces challenges due to the long-range decay characteristics of Rotary Position Embedding (RoPE). Specifically, excessive positional gaps between high- and low-resolution tokens disrupt their spatial correspondence, limiting the model's fine-grained perception capabilities. To address this issue, we introduce \textbf{ID-Align}, an innovative positional encoding strategy designed to preserve hierarchical relationships focusing on the alignment of image token position IDs across varying resolutions. In this method, high-resolution tokens inherit IDs from their corresponding low-resolution counterparts while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a $6.07\%$ enhancement on MMBench’s cross-instance fine-grained perception tasks and notable gains across multiple benchmarks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application;cross-modal pretraining;
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8347
Loading