Event-Anchored Frame Selection for Efficient Long-Video Understanding

Wang Chen; Yongdong Luo; Yuhui Zeng; Luojun Lin; Tianyu Xie; Yan Zhang; Fei Chao; Rongrong Ji; Xiawu Zheng

Event-Anchored Frame Selection for Efficient Long-Video Understanding

Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Yan Zhang, Fei Chao, Rongrong Ji, Xiawu Zheng

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Understanding, Video-based LLM, Frame Selection

TL;DR: Event-aware, training-free keyframe selection drastically improves LVLMs’ understanding of long videos

Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce $\textbf{E}$vent-Anchored $\textbf{F}$rame $\textbf{S}$election $\textbf{(EFS)}$, a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a $\textbf{training-free, plug-and-play module,}$ EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by $\textbf{4.7\\%, 4.9\\%, and 8.8\\%}$ on VideoMME, LongVideoBench, and MLVU, respectively. Code is provided in the supplementary material and will be released publicly.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6904

Loading