Policy Gradient with Tree Expansion

Gal Dalal; Assaf Hallak; Gugan Thoppe; Shie Mannor; Gal Chechik

Policy Gradient with Tree Expansion

Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: SoftTreeMax: A novel parametric policy based on tree expansion that can be optimized using policy gradient methods

Abstract: Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

Lay Summary: Imagine you're learning to play a video game where you need to make a sequence of moves to maximize your score. Most AI agents that learn through trial and error suffer from a major problem: their learning process is very noisy and unstable, like trying to navigate in a storm. This makes them slow to improve and requires enormous amounts of practice to get good results. Our solution, called SoftTreeMax, works like a chess player who thinks several moves ahead before deciding what to do. Instead of just considering immediate consequences, our AI looks at all possible sequences of moves over the next few steps and chooses actions based on this forward-looking analysis. We proved mathematically that this approach reduces learning noise exponentially—the further ahead the agent looks, the more stable its learning becomes. Testing on classic Atari games, our method achieved up to 5 times higher scores than standard approaches while reducing learning noise by a factor of 1,000, making AI training much more efficient and reliable across many applications.

Link To Code: https://github.com/NVlabs/SoftTreeMax

Primary Area: Reinforcement Learning

Keywords: Reinforcement Learning, Policy Gradient, Tree Expansion, Softmax, Variance Reduction

Submission Number: 9694

Loading