Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini; Federico Bianchi; Edoardo Zorzi; Thiago D. Simão; Alessandro Farinelli; Matthijs T. J. Spaan

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan

Published: 24 Apr 2023, Last Modified: 15 Jun 2023ICML 2023 PosterEveryoneRevisions

Abstract: Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.

Submission Number: 2738

Loading