Evidence Budgeting for Efficient Long-Form Video Question Answering

Published: 21 Jun 2023, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Long-form video question answering systems are commonly evaluated with a uniform frame budget for every question across all queries without stark variation depends on actual demands. This practice overlooks a simple but important fact: some questions can be answered from sparse temporal cues, while others require substantially broader visual coverage. In this work, we study this imbalance on Video-MME-v2 and related long-video benchmarks, showing that no single frozen frame-sampling budget is consistently optimal across question types, temporal ranges, and reasoning categories. Motivated by these observations, we propose a training-free adaptive evidence budgeting framework for test-time inference. The method estimates the visual evidence demand of each question, assigns frame-budgets under a fixed-mean constraint, and dispatches each example to one of several pre-defined frame sampling policies. Because the underlying video encoder, language model, and sampling strategies remain unchanged, the FrameRouter framework can be added to existing long-video QA pipelines without retraining. Experiments on Video-MME-v2 and several long-video understanding benchmarks demonstrate that allocating the same mean number of frames more selectively across questions improves accuracy and robustness compared with uniform-budget baselines, especially for mixed-duration and multi-evidence queries.
Loading