CoSeLECT: Adaptive Frame Selection for Video-Language Understanding

Published: 29 May 2026, Last Modified: 29 May 2026VidLLMs 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Efficient Inference, Video Large Language Models, Frame Selection
TL;DR: Training-free, plug-and-play video frame selection for MLLMs that combines query relevance and visual continuity signals to adaptively pick diverse, informative frames - outperforming both trained and training-free baselines across benchmarks.
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on image understanding tasks, but video comprehension remains a significant challenge due to the high computational cost of processing long frame sequences and the limited token capacity of underlying Large Language Models (LLMs). Prior approaches to address this often rely on uniform frame sampling, query-agnostic pruning, or require costly training of dedicated compression modules. In this work, we introduce CoSeLECT, a training-free, plug-and-play, query-guided frame selection method that intelligently subsamples video frames for efficient use in MLLMs. CoSeLECT leverages two key signals: temporal redundancy, which identifies similar frame clusters, and query relevance, which selects frames based on their semantic alignment with the input query. By combining these signals through an adaptive frame selection strategy, CoSeLECT selects frames that are both diverse and highly relevant to the query, without requiring any model-specific tuning. Our results on various base MLLMs show that CoSeLECT consistently outperforms trained and training-free state-of-the-art methods, including LongVU by +3.8% on MLVU and AKS by +4.5% on EgoSchema. A reference implementation is provided in the appendix.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading