Keywords: Video Browsing, Video Understanding, Video Agent
Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video.
In this paper, we first formalize the task of Agentic Video Browsing and introduce **Video-BrowseComp**, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos.
We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding.
To address this, we propose **Video-Browser**, a novel agent leveraging *Pyramidal Perception*, filtering with cheap metadata and zooming in with expensive visual perception only when necessary. Experiments demonstrate that our approach achieves a 37.5\% relative improvement while reducing token consumption by 58.3\% compared to Direct visual inference, establishing a foundation for verifiable open-web video research.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: multi-modal agents, agent evaluation
Contribution Types: Approaches low compute settings-efficiency, Data resources
Languages Studied: English
Submission Number: 7749
Loading