Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima

Navdeep Kumar; Priyank Agrawal; Kfir Yehuda Levy; Shie Mannor

Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima

Navdeep Kumar, Priyank Agrawal, Kfir Yehuda Levy, Shie Mannor

Published: 19 Mar 2024, Last Modified: 14 Aug 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Policy gradient method, Tree Search, Markov Decision Processes, Reinforcement Learning

TL;DR: We show combining tree search with policy gradient methods can help find better solution, that is, avoid local minimas.

Abstract: The policy gradient (PG) methods are being extensively used in practice. However, their theoretical convergence guarantees require strict regularity conditions. Such conditions are unnatural and generally not satisfied in practice, causing such techniques to get stuck in a sub-optimal local maximum (rewards). Tree search (TS) methods, have been recently shown to enjoy strong empirical performance in related planning tasks. In this work, we attempt at first theoretical analysis of Tree search-based policy gradient and its convergence properties. Specifically, we show that for a large tree length, the number of local maxima decreases, and therefore in the limiting case, PG converges to a global optimal solution.

Submission Number: 23

Loading