ROVER: Reasoning Over Video with Efficient Retrieval

ROVER: Reasoning Over Video with Efficient Retrieval

ACL ARR 2026 January Submission5543 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video understanding, video reasoning, long video reasoning

Abstract: Answering questions in natural language based on a given video is commonly referred to as the VideoQA task. When videos are long and questions are complex, multi-step reasoning is often required to integrate visual evidence distributed throughout the video. Efficiently sampling relevant visual evidence from long videos under limited computational budgets remains a key challenge. In this paper, we use visual tokens as a measure of sampling cost and propose ROVER (Reasoning Over Video with Efficient Retrieval). ROVER is a tool-augmented framework that first retrieves low-resolution frames containing fewer visual tokens to locate relevant events, and then selectively zooms in by retrieving high-resolution frames with richer visual details. ROVER is trained using a SFT-then-RL recipe, enabling dynamic coordination of low- and high-resolution frame retrieval under a question-dependent visual-token budget. ROVER achieves state-of-the-art performance on 3 out of 4 video reasoning benchmarks, while remaining competitive on four general VideoQA benchmarks. Extensive experiments also empirically show a strong accuracy--efficiency balance.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Question Answering

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5543

Loading