Video-Browser: Towards Agentic Open-web Video Browsing

Video-Browser: Towards Agentic Open-web Video Browsing

ACL ARR 2026 January Submission7749 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Browsing, Video Understanding, Video Agent

Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce **Video-BrowseComp**, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose **Video-Browser**, a novel agent leveraging *Pyramidal Perception*, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5\% relative improvement while reducing token consumption by 58.3\% compared to Direct visual inference, establishing a foundation for verifiable open-web video research.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: multi-modal agents, agent evaluation

Contribution Types: Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 7749

Loading