Language Models as Implicit Tree Search

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: RL-free preference optimization with asymptotic equivalence to approximate MCTS by language model learning stochastic policy.
Abstract: Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.
Lay Summary: Imagine teaching an AI to be helpful. It can learn to follow our instructions and understand our preference well (like a good assistant), but it often struggles with complex thinking needed for tough puzzles or mathematics. Making it good at both usually requires complicated training methods. We created a "team" of two AIs. The first AI learns to understand and follow human preferences. The second AI acts like a clever "thinking coach," guiding the first one to explore ideas and find smart solutions, similar to how a chess AI master plans moves, but without the usual complex training steps. This teamwork means AI can become both a better listener (understanding our preferences) and a sharper thinker (solving difficult problems). This leads to more capable, reliable, and helpful AI for everyone, as well as keeping the talent for math, planning, or other complex tasks.
Primary Area: Deep Learning->Large Language Models
Keywords: RL-free preference optimization; LLM based MCTS; LLM alignment;LLM reasoning
Submission Number: 1157
Loading