SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Expansion

Gal Dalal; Assaf Hallak; Gugan Thoppe; Shie Mannor; Gal Chechik

SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Expansion

Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Reinforcement Learning, Policy Gradient, Softmax, Tree expansion

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We introduce SoftTreeMax, a novel parametric policy combining tree expansion into policy gradient. We analyze its variance and bias, and implement a deep RL version of it.

Abstract: Policy gradient methods suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze the gradient variance of SoftTreeMax and reveal for the first time how tree expansion helps reduce this variance. We prove that the variance decays exponentially with the planning horizon as a function of the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the faster the decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance decay. Ours is the first result to bound the gradient bias with an approximate model. In a practical implementation of SoftTreeMax, we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5083

Loading