WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

ICLR 2026 Conference Submission7834 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Agent, Web Agent, Deep Research, Visual Question Answering (VQA), Tool-augmented Reasoning, Multimodal Information-Seeking Benchmark
TL;DR: We present WebWatcher, a multimodal web agent that learns from synthetic trajectories and reinforcement learning to achieve state-of-the-art performance in complex information-seeking tasks requiring joint visual and textual reasoning.
Abstract: Web agents such as deep research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains largely text-centric, overlooking visual information in the real world. This makes multimodal deep research highly challenging, as such agents require much stronger perceptual, logical, and knowledge-based reasoning abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with enhanced visual-language reasoning capabilities. It uses high-quality synthetic trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with the style of BrowseComp that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher outperforms or matches proprietary baselines, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7834
Loading