Power Mean Estimation in Stochastic Continuous Monte-Carlo Tree Search

Tuan Quang Dam

Power Mean Estimation in Stochastic Continuous Monte-Carlo Tree Search

Tuan Quang Dam

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a theoretically-guaranteed planning algorithm that enables making optimal continuous decisions in unpredictable environments, outperforming existing methods on robotic control tasks.

Abstract: Monte-Carlo Tree Search (MCTS) has demonstrated success in online planning for deterministic environments, yet significant challenges remain in adapting it to stochastic Markov Decision Processes (MDPs), particularly in continuous state-action spaces. Existing methods, such as HOOT, which combines MCTS with the Hierarchical Optimistic Optimization (HOO) bandit strategy, address continuous spaces but rely on a logarithmic exploration bonus that lacks theoretical guarantees in non-stationary, stochastic settings. Recent advancements, such as Poly-HOOT, introduced a polynomial bonus term to achieve convergence in deterministic MDPs, though a similar theory for stochastic MDPs remains undeveloped. In this paper, we propose a novel MCTS algorithm, Stochastic-Power-HOOT, designed for continuous, stochastic MDPs. Stochastic-Power-HOOT integrates a power mean as a value backup operator, alongside a polynomial exploration bonus to address the non-stationarity inherent in continuous action spaces. Our theoretical analysis establishes that Stochastic-Power-HOOT converges at a polynomial rate of $\mathcal{O}(n^{-1/2})$, where $ n $ is the number of visited trajectories, thereby extending the non-asymptotic convergence guarantees of Poly-HOOT to stochastic environments. Experimental results on synthetic and stochastic tasks validate our theoretical findings, demonstrating the effectiveness of Stochastic-Power-HOOT in continuous, stochastic domains.

Lay Summary: Imagine you're playing a complex video game where you need to make decisions continuously (like steering a car smoothly rather than just turning left or right) and the game world is unpredictable (sometimes the same action leads to different outcomes). Traditional artificial intelligence planning methods struggle in such scenarios because they were designed for simpler, more predictable environments. Our research tackles this challenge by developing a new AI planning algorithm called Stochastic-Power-HOOT. Think of it as a smarter way for computers to "think ahead" when making decisions in complex, uncertain environments. The key innovation is using a mathematical technique called "power mean" - imagine it as a sophisticated averaging method that can be tuned to be more optimistic or conservative depending on the situation, combined with a systematic way to explore the vast space of possible actions. The traditional approach is like trying to navigate a city by only considering a few predetermined routes. Our method is more like having a GPS that can dynamically explore the entire road network while learning which paths are most promising. We prove mathematically that our algorithm will eventually find near-optimal solutions, and importantly, we show it works even when the environment is unpredictable. We tested our approach on robotic control tasks, from simple balance problems to complex humanoid robots with hundreds of moving parts. The results show that our method consistently outperforms existing approaches, especially in noisy, uncertain environments. This advance could lead to better autonomous vehicles, more capable robots, and AI systems that can handle real-world complexity more effectively. The broader impact is significant: as AI systems increasingly operate in unpredictable real-world environments - from self-driving cars navigating busy streets to robots working alongside humans - having reliable planning algorithms that can handle uncertainty is crucial for both performance and safety.

Primary Area: Reinforcement Learning->Planning

Keywords: Monte-Carlo Tree Search; Continuous Reinforcement Learning Planning

Flagged For Ethics Review: true

Submission Number: 15370

Loading