Keywords: Web agents, deep research, data synthesis, large language models
TL;DR: A progressive data synthesis pipeline to generate challenging long-horizon agentic data for deep research and web agent training.
Abstract: Web-based, or deep research, agents often tackle complex question answering tasks by engaging in long-horizon interactions with the web tools to extract relevant information. Such long-horizon tasks may pose challenges for agents whose the underlying language models may not have been optimized for. Previous work has proposed different workflows to construct instruction tuning data (SFT) to train agents using variants of knowledge graphs. While sometimes equipped with filtering mechanisms, existing methods may lack fine-grained difficulty and quality control, leading to the synthetic data not being sufficiently difficult or long-horizon. We propose a two-pronged agentic data synthesis pipeline where question-answer pairs are created by iteratively and gradually increasing the complexity and difficulty of the questions until a frontier baseline web agent fails to answer. During such data creation process, the baseline agent is used to attempt the questions, validate factuality and check for alternative answers among other aggressive filtering procedures. In the experiments across various web-based benchmarks, we demonstrate that the dataset obtained from our pipeline, despite smaller in size, outperforms various existing datasets in training effective web agents. Among many benefits, our dataset has 2 times more diverse tool call actions than previous datasets, which helps improve trained models and avoid tool-calling repetition behaviors in models.
Primary Area: datasets and benchmarks
Submission Number: 21644
Loading