MSearcher: Self-Reflective Search Agent Empowered by Monte Carlo Tree Search Based Data Synthesis

MSearcher: Self-Reflective Search Agent Empowered by Monte Carlo Tree Search Based Data Synthesis

ICLR 2026 Conference Submission25063 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Construction, Monte Carlo Tree Search, Post Training, Reinforcement Learning, Question Answering

Abstract: Recent advances in reinforcement learning (RL) have enabled large language models (LLMs) to perform multi-turn chain-of-thought (CoT) reasoning with tool use, where web search serves as the most critical tool for answering complex questions. However, most existing methods apply RL directly to off-the-shelf models without a supervised fine-tuning (SFT) cold start, resulting in unstable training and limited tool invocations. This difficulty is exacerbated by the high cost of curating long reasoning trajectories, which are expensive to annotate and prone to factual drift. We propose MSearcher, a two-stage trained search agent that combines reflective thinking with robust tool use for complex reasoning. A central contribution is an efficient data construction framework based on Monte Carlo Tree Search (MCTS), which produces self-reflective reasoning trajectories for the SFT cold start. This framework leverages both correct and flawed rollouts to generate natural and diverse reasoning data. We adopt a two-stage pipeline, first applying SFT with our constructed data and then further training the model with RL, achieving substantial improvements on multi-hop question answering: 67.6\% on HotpotQA and 52.0\% on Frames. These results highlight the importance of high-quality SFT in stabilizing RL and equipping LLMs with robust long-horizon reasoning capabilities.

Primary Area: reinforcement learning

Submission Number: 25063

Loading