Beyond Static Contexts: Evaluating Video-Language Models with Query-Adaptive Frame Constraints

Hang Yu

Published: 01 May 2022, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Recent advances in long-video question answering rely on benchmarks that uniformly constrain the visual context, feeding a fixed number of frames per query. We demonstrate that this evaluation paradigm creates a structural mismatch: the temporal evidence needed to resolve queries on datasets like Video-MME-v2 spans a wide spectrum. Simply increasing the global frame limit offers marginal gains and scales poorly. Instead, we formalize a fixed-mean inference setting where the frame budget varies per question but averages out to the standard uniform baseline. To execute this, we design a query-adaptive, training-free router that selects from a predefined set of fixed-budget sampling policies. Because the framework modulates only the capacity of the visual context rather than the actual temporal grounding, it acts as an orthogonal enhancement to existing frame-selection algorithms. Shifting the distribution of visual compute according to query demand provides a scalable pathway to higher accuracy on long-form multimodal reasoning tasks.