Question-Conditioned Frame Allocation under Fixed Visual Budgets for Long Video QA

Luyao Tang, Zheyuan Cai

Published: 03 May 2024, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Recent long-video understanding benchmarks expose a tension between computational efficiency and evidence coverage. Most evaluation protocols allocate an identical number of frames to every question, even though the necessary visual context varies widely across queries. We revisit this assumption in long video question answering and find that increasing the number of sampled frames does not uniformly benefit all question categories; in several cases, smaller or differently sampled budgets yield stronger performance. To address this issue, we introduce a question-conditioned frame allocation approach that operates at inference time under a fixed mean frame budget. The approach first predicts the likely evidence requirement of a query, then selects a budget-compatible sampling policy from a frozen set of candidates while preserving the same average visual cost as a standard uniform setting. This design separates budget assignment from frame selection, making it complementary to existing query-aware or saliency-based sampling methods. Evaluations on Video-MME-v2 and additional long-video QA datasets show that budget routing provides consistent gains over fixed-budget inference, indicating that selective visual spending is a practical path toward stronger and more efficient long-video reasoning.