Keywords: video understanding, Long-video understanding
Abstract: Understanding long-form videos remains a significant challenge for vision-language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues occurring near important events. Other methods instead emphasize visual diversity, but neglect to consider query relevance. We propose \textbf{AdaRD-Key}, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance–Diversity Max-Volume (RD-MV) objective, which combines a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism. When the relevance distribution indicates weak alignment with the video, the method seamlessly shifts into a diversity-only mode, thereby enhancing coverage without requiring additional supervision. Our entire pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME further demonstrate that AdaRD-Key achieves state-of-the-art
performance, particularly on long-form videos.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2069
Loading