Policy Guided Tree Search for Enhanced LLM Reasoning

Yang Li

Policy Guided Tree Search for Enhanced LLM Reasoning

Yang Li

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Lay Summary: Large language models (LLMs) like ChatGPT can solve complex problems by generating step-by-step reasoning. However, they often waste computational effort by exploring too many unnecessary steps, even for simple tasks—a problem sometimes referred to as "overthinking." This inefficiency limits their practicality, especially in time-sensitive or resource-constrained applications. To address this, we introduce a new method called Policy-Guided Tree Search (PGTS). PGTS treats the language model as an environment and trains a lightweight decision-making policy to guide which reasoning steps to explore. This policy learns to prioritize the most promising paths, enabling the system to focus its effort where it matters most. Our approach builds on techniques like reinforcement learning and graph neural networks to learn this guidance policy efficiently. Unlike prior methods that blindly generate many reasoning paths or rely on handcrafted heuristics, PGTS adaptively allocates inference effort in a smarter, learned way. PGTS improves both the quality and efficiency of language model reasoning, making it faster and more accurate on challenging tasks like logic puzzles and planning problems. Ultimately, this work pushes us closer to language models that can reason more like humans—deliberate, structured, and efficient.

Link To Code: https://github.com/leao1995/llm_reasoning

Primary Area: Deep Learning->Large Language Models

Keywords: LLM Reasoning; Tree Search; Reinforcement Learning

Submission Number: 6556

Loading