TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES

Amir Yazdanbakhsh; Ebrahim Songhori; Robert Ormandi; Anna Goldie; Azalia Mirhoseini

TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES

Amir Yazdanbakhsh, Ebrahim Songhori, Robert Ormandi, Anna Goldie, Azalia Mirhoseini

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: monte-carlo tree search, reinforcement learning, tree search, policy optimization

TL;DR: We use MCTS to further optimize a bootstrapped policy for continuous action spaces under a policy iteration setting.

Abstract: Monte Carlo Tree Search (MCTS) has achieved impressive results on a range of discrete environments, such as Go, Mario and Arcade games, but it has not yet fulfilled its true potential in continuous domains.In this work, we introduceTPO, a tree search based policy optimization method for continuous environments. TPO takes a hybrid approach to policy optimization. Building the MCTS tree in a continuous action space and updating the policy gradient using off-policy MCTS trajectories are non-trivial. To overcome these challenges, we propose limiting tree search branching factor by drawing only few action samples from the policy distribution and define a new loss function based on the trajectories’ mean and standard deviations. Our approach led to some non-intuitive findings. MCTS training generally requires a large number of samples and simulations. However, we observed that bootstrappingtree search with a pre-trained policy allows us to achieve high quality results with a low MCTS branching factor and few number of simulations. Without the proposed policy bootstrapping, continuous MCTS would require a much larger branching factor and simulation count, rendering it computationally and prohibitively expensive. In our experiments, we use PPO as our baseline policy optimization algorithm. TPO significantly improves the policy on nearly all of our benchmarks. For example, in complex environments such as Humanoid, we achieve a 2.5×improvement over the baseline algorithm.

Data: [OpenAI Gym](https://paperswithcode.com/dataset/openai-gym)

Original Pdf: pdf

8 Replies

Loading