Zero-Shot Visual Budgeting for Temporal Multimodal Benchmarks

Hang Yu

Published: 08 May 2024, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Standard protocols for evaluating long-video understanding enforce a static temporal limit, passing a uniform frame budget for every question to the vision-language model for every test instance. Our analysis of Video-MME-v2 reveals that this one-size-fits-all constraint suboptimally distributes computational resources, as actual demands of each query exhibit stark variation requirements for visual evidence. We introduce a training-free test-time framework designed to reallocate fixed-mean frame-budget dynamically during inference. By estimating the evidence demand of incoming questions, the system routes queries among a discrete set of frozen frame-sampling capacities. The budget-constrained allocation is constrained so the global mean matches standard uniform configurations, allowing for direct comparison without increasing overall overhead. This FrameRouter layer operates independently from the underlying frame selection heuristics. Extensive testing across Video-MME-v2 and several long-video understanding benchmarks suites shows that calibrating visual context sizes to query difficulty yields consistent empirical improvements.