Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao; Wei Fu; Minyang Xie; Shusheng Xu; Chuyi He; Zhiyu Mei; Banghua Zhu; Yi Wu

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 SpotlightEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Agentic RL; Asynchronous RL; Search Agent

TL;DR: This paper introduces training expert-level search agents with large-scale RL. By using a 128 turn limit and high-quality synthetic data, the agent learns complex, long-horizon search strategies, achieving superior results on major benchmarks.

Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. Reinforcement Learning stands out as a natural choices of learning to use tools. However, existing RL agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RLmethods, e.g. ≤ 10, restrict complex strategy learning. This paper introduces ASearcher, a large-scale RL training project of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large scale QA dataset. Through RL training, our prompt-based 32B agent achieves substantial improvements, with +22.4 and +15.0 Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 51.1 on xBench and 58.1 on GAIA, achieving state-of-the-art results.

Submission Number: 100

Loading