Selective Visual Context Routing for Long-Video Understanding Benchmarks

Yuxuan Yuan

Published: 02 Feb 2026, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Long-video understanding models are often constrained by how many frames can be processed at inference time. A common solution is to impose a fixed frame budget for every question, but this uniform frame budget is poorly aligned with the heterogeneous nature of long-video queries, despite stark variation in the evidence each query actually demands. Some questions depend on a brief event, others require evidence distributed across a long temporal span, and many fall somewhere in between. We analyze this heterogeneity using Video-MME-v2 and observe that different subsets favor different frame budgets and frame-sampling policies. These results suggest that the budget itself should be treated as a question-dependent decision rather than a constant. We present a training-free test-time visual context routing method that assigns each question to a suitable frame budget while maintaining a prescribed average budget over the evaluation set. The FrameRouter uses lightweight evidence-demand signals and a budget-constrained allocation rule to choose among frozen temporal sampling configurations. Without modifying model weights or introducing new frame selectors, the method improves long-video QA performance across Video-MME-v2 and several long-video understanding benchmarks. The findings highlight the importance of matching visual context length to query difficulty and evidence distribution.