Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Ran Xin; Zeyu Zheng; Yanchen Nie; Kun Yuan; Xia Xiao

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: BFS-Prover-V2 combines multi-turn off-policy RL with a planner–prover multi-agent search to overcome training plateaus and inference bottlenecks, achieving state-of-the-art results in formal theorem proving.

Abstract: The integration of Large Language Models (LLMs) with automated theorem proving has shown immense promise, yet is constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a step-level theorem proving system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of the LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent system that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08% and 41.4% on the miniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Lay Summary: Mathematical proofs can be checked by proof checkers, but writing proofs in a form that these systems can verify is still slow and requires significant expertise. This limits the use of formal proof systems, even though they can provide very strong guarantees that a mathematical argument is correct. We introduce BFS-Prover-V2, an AI system that helps construct formal mathematical proofs step by step. Instead of trying to write an entire proof at once, the system proposes one proof step at a time and receives feedback from the proof checker. Yet AI proof systems often hit two walls: during training they stop getting better after a while, and on hard problems they get lost among an enormous number of possible moves. BFS-Prover-V2 addresses both. It improves through repeated practice: it learns from successful proof attempts, filters out examples that are either too easy or too unreliable, and refreshes its training data when progress stalls. For harder problems, it also uses a planning model to break a theorem into smaller goals, while multiple proof-search agents work on those goals and share each discovery so that no effort is wasted. This approach achieves strong results on standard formal mathematics benchmarks and makes automated proof construction more practical. Beyond mathematics, the same ideas may help AI systems tackle other tasks that require long, reliable chains of reasoning.

Link To Code: https://github.com/ByteDance-Seed/BFS-Prover-V2

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Models; Expert Iteration; Tree Search; Theorem Proving; Formal Mathematics

Originally Submitted PDF: pdf

Submission Number: 25818

Loading