Distributional Monte-Carlo Planning with Thompson Sampling in Stochastic Environments

Tuan Quang Dam; Brahim Driss; Odalric-Ambrym Maillard

Distributional Monte-Carlo Planning with Thompson Sampling in Stochastic Environments

Tuan Quang Dam, Brahim Driss, Odalric-Ambrym Maillard

Published: 17 Jun 2024, Last Modified: 26 Jul 2024FoRLaC PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We focus on a class of reinforcement learning algorithms, Monte-Carlo Tree Search (MCTS), in stochastic settings. While recent advancements combining MCTS with deep learning have excelled in deterministic environments, they face challenges in highly stochastic settings, leading to suboptimal action choices and decreased performance. Distributional Reinforcement Learning (RL) addresses these challenges by extending the traditional Bellman equation to consider value distributions instead of a single mean value, showing promising results in Deep Q Learning. In this paper, we bring the concept of Distributional RL to MCTS, focusing on modeling value functions as categorical and particle distributions. Consequently, we propose two novel algorithms: Categorical Thompson Sampling for MCTS (CATS), which uses categorical distributions for Q values, and Particle Thompson Sampling for MCTS (PATS), which models Q values with particle-based distributions. Both algorithms employ Thompson Sampling to handle action selection randomness. Our contributions are threefold: We introduce a distributional framework for Monte-Carlo Planning to model uncertainty in return estimation. We prove the effectiveness of our algorithms by achieving a non-asymptotic problem-dependent upper bound on simple regret of order $O(n^{-1})$, where $n$ is the number of trajectories. We provide empirical evidence demonstrating the efficacy of our approach compared to baselines in both stochastic and deterministic environments.

Format: Long format (up to 8 pages + refs, appendix)

Publication Status: No

Submission Number: 80

Loading