WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

ICLR 2026 Conference Submission7834 Authors

16 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Agent, Web Agent, Deep Research, Visual Question Answering (VQA), Tool-augmented Reasoning, Multimodal Information-Seeking Benchmark
TL;DR: We present WebWatcher, a multimodal web agent that learns from synthetic trajectories and reinforcement learning to achieve state-of-the-art performance in complex information-seeking tasks requiring joint visual and textual reasoning.
Abstract: Web agents such as deep research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains largely text-centric, overlooking visual information in the real world. This makes multimodal deep research highly challenging, as such agents require much stronger perceptual, logical, and knowledge-based reasoning abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with joint reasoning ability across both visual and textual modalities. It uses high-quality synthetic trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with the style of BrowseComp that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher outperforms the prompt-based workflow and open-source agents on HLE and BrowseComp-VL, and demonstrates its perception, multimodal reasoning, and searching capabilities across the other three benchmarks, respectively.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7834
Loading