Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

ICLR 2026 Conference Submission16289 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-Play, Deep Search, LLM, Agent, RLVR

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human effort and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of the synthetic agentic task can hardly be controlled to provide effective RL training advantages. Towards more effective agentic training, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and predict the correct ground-truth answer. To ensure each proposed search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as the external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. Within our search self-play (SSP) game, the proposer and the solver co-evolve their agentic capabilities via both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups.

Primary Area: reinforcement learning

Submission Number: 16289

Loading