Keywords: Web Agent Foundation Models, Automatic Data Scaling
Abstract: Deep research web agents must not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge in order to generate high-quality, insightful research.
However, existing open-source deep research agent systems predominantly focus on enhancing *information seeking* capabilities of web agents to *locate* specific information, while overlooking the essential need for *information aggregation*, which would limit their ability to generate coherent insights or support in-depth research.
In this paper, we propose a paradigm for scalably constructing verifiable training datasets for web agents, by framing data construction as an agentic task grounded in real web pages while placing additional focus on developing fine-grained rules that enable complex information aggregation.
Our approach synthesizes tasks by first collecting information through *proactive online web exploring* on the real web environment,
followed by *Complex Aggregation Logic Injection* to compose the verifiable question-answer pairs from aggregated knowledge snippets, covering over 12 logical operations.
The resulting dataset contains about 10K samples across 50K websites, covering more than 11 domains.
Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, named WebAggregator.
WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10\% on GAIA-text and closely approaches the performance of Claude-3.7-sonnet.
Moreover, given the limited availability of benchmarks that evaluate web agents’ information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%, and even after retrieving all of the references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14932
Loading