Fixed-Average Frame Scheduling for Query-Adaptive Long Video Reasoning
Abstract: The effectiveness of long-video reasoning systems depends not only on which frames are sampled, but also on how many frames are made available for each query. Standard long-video question answering evaluations typically use a uniform frame count, implicitly assuming that all questions require the same amount of visual evidence. We challenge this assumption through an empirical study of Video-MME-v2, where we find substantial variation in preferred frame budgets across question groups and reasoning types. Building on this analysis, we propose a fixed-average frame scheduling strategy for long-video inference. Given a target mean frame budget, the scheduler assigns different per-question budgets according to estimated evidence demand and routes examples to a frozen bank of sampling policies. The method is training-free, model-agnostic, and preserves the average computational footprint of the corresponding uniform-budget baseline. Across Video-MME-v2 and other long-video understanding benchmarks, fixed-average scheduling improves question answering accuracy without increasing mean visual input cost. These results suggest that adaptive budget control is a useful complement to existing advances in temporal sampling and multimodal reasoning.
Loading