ROVER: Reasoning Over Video with Efficient Retrieval

ACL ARR 2026 January Submission5543 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video understanding, video reasoning, long video reasoning
Abstract: Answering questions in natural language based on a given video is commonly referred to as the VideoQA task. When videos are long and questions are complex, multi-step reasoning is often required to integrate visual evidence distributed throughout the video. Efficiently sampling relevant visual evidence from long videos under limited computational budgets remains a key challenge. In this paper, we use visual tokens as a measure of sampling cost and propose ROVER (Reasoning Over Video with Efficient Retrieval). ROVER is a tool-augmented framework that first retrieves low-resolution frames containing fewer visual tokens to locate relevant events, and then selectively zooms in by retrieving high-resolution frames with richer visual details. ROVER is trained using a SFT-then-RL recipe, enabling dynamic coordination of low- and high-resolution frame retrieval under a question-dependent visual-token budget. ROVER achieves state-of-the-art performance on 3 out of 4 video reasoning benchmarks, while remaining competitive on four general VideoQA benchmarks. Extensive experiments also empirically show a strong accuracy--efficiency balance.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Question Answering
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5543
Loading