Training-Free Adaptive Frame Selection for Video-Language Understanding

Bhavika Suresh Devnani; Jitesh Jain; Humphrey Shi; Judy Hoffman

Training-Free Adaptive Frame Selection for Video-Language Understanding

Bhavika Suresh Devnani, Jitesh Jain, Humphrey Shi, Judy Hoffman

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Language Understanding, Efficient Multimodal Large Language Models

TL;DR: Training-Free Adaptive Frame Selection to improve efficiency and accuracy of Video Large Language Models

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on image understanding tasks, but video comprehension remains a significant challenge due to the high computational cost of processing long frame sequences and the limited token capacity of underlying Large Language Models (LLMs). Prior approaches to address this often rely on uniform frame sampling, query-agnostic pruning, or require costly training of dedicated compression modules. In this work, we introduce CoSeLECT, a training-free, plug-and-play, query-guided frame selection method that intelligently subsamples video frames for efficient use in MLLMs. CoSeLECT leverages two key signals: temporal redundancy, which identifies similar frame clusters, and query relevance, which selects frames based on their semantic alignment with the input query. By combining these signals through an adaptive frame selection strategy, CoSeLECT selects frames that are both diverse and highly relevant to the query, without requiring any model-specific tuning. Our results on various base MLLMs show that CoSeLECT consistently outperforms state-of-the-art methods, including trained methods like LongVU by +3.8\% on MVBench and +0.8\% on VideoMME.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12714

Loading