VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper identifies four key criteria for positional encoding: structure, frequency allocation, spatial symmetry, and temporal scaling. We propose VideoRoPE, which outperforms prior methods in video retrieval and understanding.
Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code and model weights will be publicly released.
Lay Summary: Videos have complex structures that make it hard for models to understand long sequences of information. Adapting previous methods designed for one-dimensional data (like text) to videos has been a challenge due to the video’s spatio-temporal nature. Our research introduces a new method called VideoRoPE that improves how models handle video by considering both time and space in a more effective way. We find that existing methods fail when distractors (unrelated elements) are added to video tasks, so we design VideoRoPE to reduce errors and handle these distractions better. This method works better than older ones across various video-related tasks, like searching for video clips or understanding scenes. Our approach helps improve how machines understand videos, making them smarter and more reliable.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Wiselnn570/VideoRoPE
Primary Area: Applications->Computer Vision
Keywords: Rotary Position Embedding (RoPE), Spatio-temporal Encoding, VideoRoPE, V-NIAH-D Task, Temporal Dimension Allocation, 3D Position Embedding, Low-frequency Temporal Allocation, Diagonal Layout, Adjustable Temporal Spacing, Video Retrieval, Video Understanding, Video Hallucination, Position Encoding for Video, Distractor Handling in RoPE, Long-context Modeling
Submission Number: 3607
Loading