Keywords: Multimodal Agent, Web Agent, Deep Research, Visual Question Answering (VQA), Tool-augmented Reasoning, Multimodal Information-Seeking Benchmark
TL;DR: We present WebWatcher, a multimodal web agent that learns from synthetic trajectories and reinforcement learning to achieve state-of-the-art performance in complex information-seeking tasks requiring joint visual and textual reasoning.
Abstract: Web agents such as deep research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However,
most research remains largely text-centric, overlooking visual information in the
real world. This makes multimodal deep research highly challenging, as such
agents require much stronger perceptual, logical, and knowledge-based reasoning
abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with joint
reasoning ability across both visual and textual modalities. It uses high-quality
synthetic trajectories for efficient cold start training, utilizes various tools for deep
reasoning, and further enhances generalization through reinforcement learning. To
better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL,
a benchmark with the style of BrowseComp that requires complex information
retrieval involving both visual and textual information. Experimental results show
that WebWatcher outperforms the prompt-based workflow and open-source agents
on HLE and BrowseComp-VL, and demonstrates its perception, multimodal reasoning, and searching capabilities across the other three benchmarks, respectively.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7834
Loading