Keywords: Video Language Understanding, Efficient Multimodal Large Language Models
TL;DR: Training-Free Adaptive Frame Selection to improve efficiency and accuracy of Video Large Language Models
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on image understanding tasks, but video comprehension remains a significant challenge due to the high computational cost of processing long frame sequences and the limited token capacity of underlying Large Language Models (LLMs). Prior approaches to address this often rely on uniform frame sampling, query-agnostic pruning, or require costly training of dedicated compression modules. In this work, we introduce CoSeLECT, a training-free, plug-and-play, query-guided frame selection method that intelligently subsamples video frames for efficient use in MLLMs. CoSeLECT leverages two key signals: temporal redundancy, which identifies similar frame clusters, and query relevance, which selects frames based on their semantic alignment with the input query. By combining these signals through an adaptive frame selection strategy, CoSeLECT selects frames that are both diverse and highly relevant to the query, without requiring any model-specific tuning. Our results on various base MLLMs show that CoSeLECT consistently outperforms state-of-the-art methods, including trained methods like LongVU by +3.8\% on MVBench and +0.8\% on VideoMME.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12714
Loading