\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{hyperref}

\usepackage{times}  % DO NOT CHANGE THIS
\usepackage{helvet}  % DO NOT CHANGE THIS
\usepackage{courier}  % DO NOT CHANGE THIS
% \usepackage[hyphens]{url}  % DO NOT CHANGE THIS
\usepackage{graphicx} % DO NOT CHANGE THIS
\urlstyle{rm} % DO NOT CHANGE THIS
\def\UrlFont{\rm}  % DO NOT CHANGE THIS
\usepackage{natbib}  % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
\usepackage{caption} % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT

\usepackage{amssymb}
\usepackage{amsthm} 
\usepackage{lipsum}

\usepackage{mdframed}

\usepackage{tikz}
\usepackage{ctable}
\usepackage{multirow}

\usepackage{mathtools}

\DeclareCaptionStyle{ruled}{labelfont=normalfont,labelsep=colon,strut=off} % DO NOT CHANGE THIS
\frenchspacing  % DO NOT CHANGE THIS
\setlength{\pdfpagewidth}{8.5in}  % DO NOT CHANGE THIS
\setlength{\pdfpageheight}{11in}  % DO NOT 

\input{math_commands.tex}
% \usepackage{hyperref}
\usepackage{url}
% \usepackage{algorithm}
% \usepackage{algpseudocode}
\usepackage{subfig}
\usepackage{comment}
\usepackage{multirow}
% \usepackage[]{apacite}
% \usepackage{cite}
% \bibliographystyle{alpha}
\usepackage{booktabs}
\usepackage{minted}

\usepackage{graphicx}       
\usepackage{caption}                    
\usepackage{float}      
\usepackage{subfig} 

\def\Pr{\mathbf{Pr}}
\renewcommand\labelenumi{(\roman{enumi})}
\renewcommand\theenumi\labelenumi

\newcommand\omicron{o}

%%% Tuan Added
\usepackage{amsmath,amsfonts,bm,nicefrac,amssymb}
\usepackage{amsthm}

\SetAlFnt{\small}
\SetAlCapFnt{\small}
\SetAlCapNameFnt{\small}
\usepackage{braket}
% \usepackage[noend]{algpseudocode}
\usepackage{algorithmic}
\usepackage{amssymb}
\usepackage{scalerel}
\usepackage{multicol}

\newmdtheoremenv{theorembox}{Theorem}
\newmdtheoremenv{lemmabox}{Lemma}
\newmdtheoremenv{definitionbox}{Definition}

\usepackage{empheq}
\usepackage{xcolor}
\definecolor{lightgreen}{HTML}{90EE90}
\newcommand{\boxedeq}[2]{\begin{empheq}[box={\fboxsep=6pt\fbox}]{align}\label{#1}#2\end{empheq}}
\newcommand{\coloredeq}[2]{\begin{empheq}[box=\colorbox{lightgreen}]{align}\label{#1}#2\end{empheq}}


\usepackage{todonotes}
\usepackage{blindtext}

%%%% Emilie
\usepackage{dsfont}
\newcommand{\todoe}[1]{\todo[inline,color=green!40]{EK: #1}}
\newcommand{\todot}[1]{\todo[inline,color=blue!40]{TD: #1}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\bE}{\mathbb{E}}
\newcommand{\bP}{\mathbb{P}}
\newcommand{\ind}{\mathds{1}}
\newcommand{\cv}[1]{\overset{#1}{\underset{n \rightarrow \infty}{\longrightarrow}}}
\newtheorem{remark}{Remark}
\newtheorem{corollary}{Corollary}

\usepackage{hyperref}
\usepackage{xcolor}
\definecolor{Bleu}{RGB}{37,144,230}
\definecolor{Red}{HTML}{CD5C5C}
\hypersetup{colorlinks,citecolor=Bleu,linkcolor=Red}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Power Mean Estimation in Stochastic Monte-Carlo Tree Search}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<tuanquangdam@gmail.com>?Subject="Power Mean Estimation in Stochastic Monte-Carlo Tree Search"}{Tuan Dam}{}}
\author[1]{Odalric-Ambrym Maillard}
\author[1]{Emilie Kaufmann}

% Add affiliations after the authors
\affil[1]{%
    Univ.  Lille, Inria, CNRS, Centrale Lille, UMR 9198-CRIStAL, F-59000 Lille, France
}

\begin{document}
\maketitle

\begin{abstract}
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation~\citep{shah2022journal}.
Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it~\citep{coulom2006efficient}. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT~\citep{dam2019generalized} incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete.
This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of $\mathcal{O}(n^{-1/2})$, with $n$ is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments.

% Monte-Carlo Tree Search (MCTS) is a popular online planning strategy that combines Monte-Carlo sampling with forward tree search for optimal decision-making. Its success hinges on the application of the Upper Confidence bound for Trees (UCT) algorithm, an extension of the Upper Confidence bound (UCB) method used in multi-arm bandits. However, the theoretical foundation of UCT has been incomplete due to an error in the logarithmic bonus term for action selection. This led to the introduction of a polynomial exploration bonus in Fixed-Depth-MCTS to better balance exploration and exploitation~\citep{shah2020non}.
% A key issue with UCT and Fixed-Depth-MCTS is their biased value estimation: a weighted sum underestimates the optimal value, while the maximum valuation overestimates the best mean value. The power mean estimator addresses this by providing a balanced value estimation, lying between the average and the maximum. Power-UCT~\citep{dam2019generalized}, which incorporates the power mean estimator, offers more accurate value estimates but its theoretical proof remains incomplete.
% This paper introduces Stochastic-Power-UCT, an MCTS algorithm for stochastic MDPs using the power mean estimator. We study its polynomial convergence in estimating root node values. Our findings show that Stochastic-Power-UCT achieves the same convergence rate of $\mathcal{O}(n^{-1/2})$ as Fixed-Depth-MCTS, with the latter being a special case of the former. We validate our theoretical results with empirical tests across various stochastic MDP environments, supporting our claims with empirical evidence.
\end{abstract}

% \begin{abstract}
% Monte-Carlo Tree Search (MCTS) is a widely-used online planning strategy that effectively combines Monte-Carlo sampling with forward tree search to make optimal decisions in various real-world scenarios. Its success hinges on the application of the Upper Confidence bound for Trees (UCT) algorithm, an extension of the Upper Confidence bound (UCB) method used in multi-arm bandits. However, theoretical investigations of UCT have been found incomplete due to an error in the estimated "logarithmic" bonus term used for action selection in the tree. This issue was addressed by introducing a "polynomial" exploration bonus, which effectively balances the exploration-exploitation trade-off called Fixed-Depth-MCTS. However, it is worth mentioning a fundamental problem with UCT and therefore Fixed-Depth-MCTS: Its weighted sum approach underestimates the optimal value in the tree while backing up the maximum valuation overestimates the best mean value[1]. To address this imbalance, the power mean estimator offers a solution by providing a balanced approach to value estimation in the tree, as the value return by power mean lies between average and max[2]. By incorporating the power mean estimator, Power-UCT[2] achieves a more nuanced and accurate estimation of values and mitigates the biases inherent in traditional UCT methods. However, the proof of Power-UCT follows the same steps as UCT, therefore remains incomplete.
% In this paper, we introduce Stochastic-Power-UCT, an MCTS algorithm explicitly designed for stochastic MDPs using power mean as the value estimator. We conduct a comprehensive study of the polynomial convergence assurance for selecting the optimal action and estimating the value at the root node within our methodology. Our findings demonstrate that Stochastic-Power-UCT shares the same convergence rate of $\mathcal{O}(n^{-1/2})$ to the optimum as Fixed-Depth-MCTS, with Fixed-Depth-MCTS being a special case of Stochastic-Power-UCT. Furthermore, we validate our theoretical results through empirical assessments across diverse stochastic MDP environments, providing empirical evidence to support our method's theoretical claims
% \end{abstract}
\section{Introduction}
\label{introduction}
Monte-Carlo Tree Search (MCTS) is a family of dynamic planning algorithms that integrates asymmetric tree search and reinforcement learning (RL) to solve decision problems. Recent advances in coupling MCTS with deep learning techniques for value estimation have facilitated the solution of complex problems with high branching factors that were considered impossible just a few years ago \citep{silver2016mastering,silver2017mastering,schrittwieser2020mastering}. The core of the success of MCTS lies in the use of adaptive exploration of the tree using, e.g. strategies inspired by the multi-armed bandit literature. One of the most well-known algorithms is Upper Confidence bound for Trees (UCT)~\citep{kocsis2006improved}, which turns the UCB1 algorithm~\citep{auer2002finite} into a strategy for selecting actions during tree expansion.

\citet{kocsis2006improved} offers a theoretical analysis of UCT in deterministic environments establishing the convergence in selecting the optimal action for a given state for a  sufficient number of simulations. However, as pointed out recently \citep{shah2020non}, there are some issues in the proof of this assertion. The problem comes from the use of a "logarithmic" bonus term within UCT, designed to balance exploration and exploitation during tree-based search. This approach is built upon the assumption that the concentration of regret for the underlying recursively dependent nonstationary MABs will exponentially converge to its expected value as the number of steps advances. However, as demonstrated by ~\cite{audibert2009exploration}, the validity of this assumption is doubtful, given that the underlying regret converges polynomially rather than exponentially.

Building upon these insights, ~\citet{shah2022journal} propose an adapted version of UCT incorporating a polynomial bonus term instead of the logarithmic bonus term in UCT. They offer a comprehensive theoretical analysis to show that the resulting algorithm, called Fixed-depth-MCTS, ensures polynomial convergence in value function estimation at the root node. 
However, their work is mostly focused on  deterministic environments.\footnote{The first version of their work \citep{shah2020non} mentioned the stochastic case as an open question, while it is treated in Appendix A of the journal version~\citep{shah2022journal}. However, they use some high-level reduction argument, and the stochastic version of their algorithm is not explicitly presented.} Moreover, the Power-UCT algorithm, introduced by~\cite{dam2019generalized} as a generalization of UCT similarly focuses on deterministic environment and its current analysis suffers from the same shortcoming as that of UCT \citep{kocsis2006improved}. 

This paper introduces a novel MCTS algorithm for Stochastic MDPs using power mean for the value estimator, called Stochastic-Power-UCT. We propose the same form of a polynomial bonus term as introduced in the work of~\citet{shah2020non}. We show that Stochastic-Power-UCT also ensures the polynomial concentration of value estimation at the root node. We complement our method by empirically performing a variety of experiments in stochastic tasks confirming our theory.
Thus, our \textit{contribution} is threefold:
\begin{enumerate}
    \item We propose Stochastic-Power-UCT with a complete theoretical convergence guarantee using the power mean backup operator in stochastic MCTS.
    \item We demonstrate that the estimated value function at the root node of our tree converges polynomially to the optimal value, exhibiting the same convergence rate as Fixed-depth-MCTS \citep{shah2020non}, which is $\mathcal{O}(n^{-1/2})$, with $n$ is the number of visited trajectories. Our method employs power mean as value estimators, with the average mean utilized in Fixed-depth-MCTS being a specific case.
    \item We conduct various experiments in SyntheticTree toy task and in various stochastic MDPs, which support our theoretical analysis.
\end{enumerate}
% \section{Related Work}
% \label{s:relatedwork}

% \todoe{This section is not useful. remove?}


% The UCT (Upper Confidence bounds for Trees)~\cite{kocsis2006improved} algorithm, an extension of the well-known UCB1~\cite{auer2002finite} multi-armed bandit algorithm, based on the Optimism under uncertainty principle designed for deterministic MDPs. \citet{kocsis2006improved} demonstrated that UCT \textit{asymptotically} converges to the optimal policy. However, their proof was found to be incomplete because it incorrectly assumed that each node in the tree is independent and identically distributed (i.i.d.), whereas each node's state actually depends on the previous action state~\citep{kocsis2006bandit,kocsis2006improved}. \citet{shah2020non} provided a corrected proof for UCT in deterministic MDPs, based on a further investigation of non-stationary multi-armed bandits (MABs). They established a polynomial concentration property of regret for this class of MABs. This corrected proof shows that MCTS, with an appropriate polynomial rather than logarithmic bonus term in UCB, maintains the claimed properties of UCT.
% In turn, \citet{shah2020non} show that the value estimation at the root node can be \textit{non-asymptotically} polynomially decayed at the rate $O(n^{-1/2})$, where $n$ is the number of visited trajectories. A short paragraph presented to show that the fixed proof can be applied to stochastic settings, however, a full proof is not provided. Stochastic-Power-UCT in our paper has been proposed with full proof in stochastic MDPs. It follows similar steps as the work of \citet{shah2020non}, however, we apply power mean as the value estimator and we show that our method share the same rate for convergence as  $\mathcal{O}(n^{-1/2})$.

\section{Setting}\label{S:setting}
In this section, we first provide some background knowledge about Markov Decision Processes, and then we give an overview of Monte-Carlo Tree Search. 

\paragraph{Markov Decision Process}
In Reinforcement Learning (RL), the agent aims to make optimal decisions in an environment modeled as a Markov Decision Process (MDP), a standard framework for sequential decision-making. We focus on a discrete-time discounted MDP, defined as $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma \rangle$, where 
$\mathcal{S}$ is the state space,
$\mathcal{A}$ is the finite action space,
$\mathcal{R}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$ is the reward function,
$\mathcal{P}: \mathcal{S} \times \mathcal{A} \to \mathcal{S}$ is the transition dynamics,
$\gamma \in [0, 1)$ is the discount factor.
A policy $\pi \in \Pi: \mathcal{S} \to \mathcal{A}$ represents a probability distribution over feasible actions given the current state.

We denote $Q^\pi(s,a)$ as a $Q$ value function under the policy $\pi$ defined based on Bellman equation as
\begin{equation}\resizebox{1.\hsize}{!}{$
    Q^\pi(s,a) \triangleq \sum_{s'} \mathcal{P}(s'|s,a)\left[\mathcal{R}(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q(s',a') \right],\nonumber
$}
\end{equation}
and the value function under the policy $\pi$ is denoted as $V^\pi(s) = \max_{a \in \mathcal{A}} Q^{\pi}(s,a)$. 
Our goal is to find the optimal policy that maximizes the value function at each state, where the optimal value function at state $s$ is defined as $V^{\star}(s) = \max_{a \in \mathcal{A}} Q^{\star}(s,a)$, with $Q^{\star}(s,a)$ the optimal $Q$ value function at state $s$ action $a$ satisfies the optimal Bellman equation~\citep{bellman1954theory} 
\begin{equation}\resizebox{1.\hsize}{!}{$
Q^*(s,a) \triangleq \sum_{s'} \mathcal{P}(s'|s,a)\left[ \mathcal{R}(s,a,s') + \gamma \max_{a'}Q^*(s',a') \right].\nonumber 
$}
\end{equation}
\paragraph{Monte-Carlo Tree Search}
Monte-Carlo Tree Search (MCTS) (see~\cite{browne2012survey} for a survey) is a family of online planning strategies that combines Monte-Carlo sampling with forward tree search to find on-the-fly optimal decisions. MCTS algorithms use a black-box model of the environment in simulation to build a planning tree. An MCTS algorithm consists of four components: Selecting nodes to traverse in the tree based on the current statistical information, expanding the tree, evaluating the leaf that has been reached (using possibly a roll-out in the environments), and using the collected rewards from the environment along the chosen path to update the algorithm. The key elements influencing the quality of a particular algorithms are an effective value update operator and an efficient node selection strategy in the tree.

\paragraph{Formalization} An MCTS algorithm adaptively collects trajectories in an MDP, starting from an initial state $s_0$, to build a planning tree. Each trajectory continues until it either reaches a leaf node or a node at some maximum depth $H$. At the end of each trajectory, a playout policy (which can be either deterministic or stochastic) is applied from the last node reached, to provide an evaluation of the corresponding state. After $t$ trajectories, it may output two things: 
\begin{itemize}
 \item $\widehat{a}_t$, a guess for the best action to take in state $s_0$
 \item $\widehat{V}_t(s_0)$ an estimator of the optimal value in $s_0$
\end{itemize}
Its quality performance can be evaluated by its convergence rate $r(t)$, of the form
\begin{flalign}
\bE\left[V^\star(s_0) - Q^\star\left(s_0,\widehat{a}_t\right)\right] \leq r(t) \\  \text{ or } \left|\bE\left[V^\star(s_0) - \widehat{V}_t\left(s_0\right)\right] \right| \leq r(t).
\end{flalign} 
We shall analyze an MCTS algorithm using some maximal planning horizon $H$ and a playout policy $\pi_{0}$ with value $V_0$. Denoting by $s_h$ a node at depth $h$ in the tree (that is identified to  some state that may be reached in $h$ steps from the root note), we can define inductively $ \widetilde{V}(s_{H})=V_{0}(s_{H})$ and, for all $h\leq H-1$,
\begin{eqnarray*}
\widetilde{Q}(s_h,a) & = & r(s_h,a) + \gamma \!\!\!\!\sum_{s_{h+1} \in \cA_s}\mathcal{P}(s_{h+1}|s_h,a)\widetilde{V}(s_{h+1}), \\ 
\widetilde{V}(s_h) & = & \max_{a} \widetilde{Q}(s_h,a),  
\end{eqnarray*}
where $r(s_h,a)$ is the mean of intermediate reward at state $s_h$ after taking action $a$. Then we have $|Q^\star(s_0,a) - \widetilde{Q}(s_0,a)| \leq \gamma^{H} \|V^\star - V_0\|_{\infty}$ (actually the supremum could be restricted to all states reachable in $H$ steps from $s_0$). The purpose of an MCTS algorithm is to minimize the convergence rate $r(t)$ by building an estimate of $\widetilde{Q}(s_0,a)$ and $\widetilde{V}(s_0)$ in order to be finally able to estimate $Q^\star(s_0,a)$ and the best action in the root note $a_\star = \argmax_{a} Q^\star(s_0,a)$.  

We are now ready to present our particular MCTS instance, Stochastic-Power-UCT algorithm.

\section{Stochastic Power-UCT}


\SetKwProg{Fn}{}{}{}
\begin{algorithm*}[ht!]
\small
  \caption{Stochastic-Power-UCT with   {$\gamma$ is a discount factor.}
  {$n:$ is the number of rollouts.}
  {$\{b_{i}\}^H_{i=0}$, $\{\alpha_{i}\}^H_{i=0}$, $\{\beta_{i}\}^H_{i=0}$ are positive algorithmic constants that satisfy conditions as in Table~\ref{algorithmic_constants}.} {$\pi_0$ is a rollout policy. $C$ is an exploration constant.}}
  \textbf{Input: } 
  \text{root node state $s_{0}$}\\
  \textbf{Output:}
  \text{optimal action at the root node}
  \label{stochastic_p_uct}
  \begin{multicols}{2}
  \SetKwFunction{SelectAction}{SelectAction}
  \SetKwFunction{Search}{Search}
  \SetKwFunction{Expand}{Expand}
  \SetKwFunction{Rollout}{Rollout}
  \SetKwFunction{SimulateV}{SimulateV}
  \SetKwFunction{SimulateQ}{SimulateQ}
  \SetKwFunction{MainLoop}{MainLoop}
  \BlankLine
  \Fn{R = \Rollout{$s, depth$}}{
    $\widetilde{V}(s) = \text{average of the call to } \pi_{0}(s)$\\
    \Return $\widetilde{V}(s)$
  }
  \BlankLine
  \Fn{a = \SelectAction{$s_{h}$, $depth=h$, greedy=false, $t$}}{
  \uIf {greedy == false} {
        $a = \underset{a}{\argmax} \Set{ \widehat{Q}_{T_{s_{h},a}(t)}(s_{h}, a) + C\frac{T_{s_{h}}(t)^{\frac{b_{h+1}}{\beta_{h+1}}}}{T_{s_{h},a}(t)^{\frac{\alpha_{h+1}}{\beta_{h+1}}}} }$\\
    } \Else{
        $a = \underset{a}{\argmax} \Set{ \widehat{Q}_{T_{s_{h},a}(t)}(s_{h}, a) }$\\
    }
    \Return $a$
  }
  \Fn{\SimulateV{$s_{h}, depth, t$}} {
    {$a \gets ${\SelectAction}$(s_{h}, depth=$h$, greedy=\text{false},t)$}\\
    {{\SimulateQ} $(s_{h},a, depth=$h$, t)$}\\
    $T_{s_{h}}(t) \gets T_{s_{h}}(t) + 1$ \\
    $\widehat{V}_{T_{s_h}(t)}(s_{h}) \gets \left( \underset{a}{\sum} \frac{T_{s_{h},a}(t)}{T_{s_{h}}(t)} (\widehat{Q}_{T_{s_{h},a}(t)})^{p}(s_{h},a) \right)^{\frac{1}{p}}$
  }
  \BlankLine
  \Fn{\SimulateQ{$s_{h}$, $a$, $depth=h$, t}}{
    {$s_{h+1}$}{$ \sim \mathcal{P}(\cdot|s_{h},a)$}\\
    {$r(s_{h},a)$}{$ \sim \mathcal{R}(s_{h},a,s_{h+1})$}\\
    \If {$ s_{h+1} \notin {Terminal} $} {
        \uIf{Node $s_{h+1} \text{ not expanded}$} {
        $\widehat{V}_{T_{s_{h+1}}(t)}(s_{h+1}) = \Rollout(s_{h+1}, depth$)\\
        } \Else {
            \SimulateV($s_{h+1}, depth=h+1$, t)\\
        }
    }
    {$\widehat{Q}_{T_{s_{h},a}(t)}(s_{h},a)$\\
    $\gets$ $\frac{\widehat{Q}_{T_{s_{h},a}(t)}(s_{h},a)T_{s_{h},a}(t) + r(s_{h},a) + \gamma \widehat{V}_{T_{s_{h+1}}(t)}(s_{h+1}) }{T_{s_{h},a}(t)+1} $}\\
    {$T_{s_{h},a}(t) \gets T_{s_{h},a}(t) + 1$}
  }
  \BlankLine
  \Fn{\MainLoop}{
    $t=0$\\
    \While{\text{$ t \leq n$}}{
      \SimulateV($s_0, depth=0,t$)\\
      $t \gets t +1$
    }
    \Return \SelectAction{$s_0, greedy=true,n$}
  }
  \BlankLine
  \end{multicols}
\end{algorithm*}

In this section, we first present a generic UCT like algorithm and then we present our Stochastic-Power-UCT algorithm.

\subsection{Generic UCT-like algorithm} 

For each node $s_h$ in depth $h$ of the search tree, and for each available action $a \in \cA_{s_h}$, we denote by 
\begin{itemize}
 \item $\widehat{V}_{t}(s_h)$ the value estimate built after $s_h$ has been visited $t$ times at depth $h$ 
 \item $\widehat{Q}_{t}(s_h,a)$ the Q-value estimate built after $(s_h,a)$ has been visited $t$ times at depth $h$
\end{itemize}
We denote by $T_{s_h}(t)$, $T_{s_h,a}(t)$ and $T^{s_{h+1}}_{s_h,a}(t)$ the number of visits of $s_h$ (respectively $(s_h,a), (s_h,a,s_{h+1})$) in depth $h$ after $t-1$ MCTS trajectories. 

A generic UCT-like algorithm depends on a sequence of bonus functions $B(t,s_h,a) $ for each depth $h$. It sequentially plays trajectories from a starting state $s_0$ until some leaf of the search tree or some maximal depth $H$ is reached (nodes at this depth are also called leaves). At the leaves, the playout policy $\pi_0$ is used to provide a (possibly random) evaluation of the value $V_0$. We assume that repeated calls to the playout policy in a state $s$ provides i.i.d. samples from a distribution with mean $V_{0}(s)$. If the playout is not stochastic but provided by a neural network, then the distribution is a Dirac delta centered at $V_0(s)$. The playout could also be fully stochastic (i.e., outputting a sum of discounted rewards under $\pi_0$ starting from $s$) or a mix of both. 

The $t$-th trajectory collected is 
\[\{s_0^{t}=s_0,a_0,r_0,s_1,a_1,r_1, s_2,a_2,r_2, \dots, s_{\ell_t},\widetilde{V}(s_{\ell_t})\},\]
where $\ell_t \leq H$ is the length of the trajectory and $\widetilde{V}(s_{\ell_t})$ is the value of the playout performed in $s_{\ell_t}$. For each $h \leq \ell_t$: 
\[a_h = \underset{a \in \cA_{s_h}}{\text{argmax}} \ \left\{\widehat{Q}_{T_{s_h,a}(t)}(s_h,a) + B\left(t,s_h,a\right) \right\},\]
and $s_{h+1} \sim \mathcal{P}(\cdot | s_h,a_h)$. If $s_{\ell_t}$ is a leaf of the search tree with $\ell_t < H$, we add to the search tree all the $Q$-nodes $(s_{\ell_t},a)$ for all available actions in $s_{\ell_t}$. After a trajectory is collected, the number of visits of the state (resp. state-action pairs) in the trajectory are updated, and the corresponding values (resp. Q-values) estimates are computed. 

After $t$ trajectories, the guess $\widehat{a}_t$ will be 
\begin{flalign}
 \widehat{a}_t = \argmax_{a \in \cA_{s_0}} \widehat{Q}_{T_{s_0,a}(t)}(s_0,a), \nonumber
\end{flalign}
and the estimate of the value of the root will be $\widehat{V}_t(s_0)$, where $\widehat{V}_t$ is a value operator to be specified.
\subsection{Stochastic Power-UCT} 
A UCT-like algorithm is fully characterized by: 
\begin{itemize}
 \item the definition of value and Q-value estimates
 \item the choice of the bonus function $B(t,s_h,a) $
 \item the maximal depth $H$ and playout policy
\end{itemize}
In the vanilla UCT algorithm \citep{kocsis2006bandit}, $B(t,s_h,a)  = C\sqrt{\frac{\log(T_{s_h}(t))}{T_{s_h,a}(t)}}$ with $C$ is an exploration constant, which is the bonus used by the UCB algorithm, used to select action in a stochastic bandit algorithm. Other bonusses have been used in practice too \citep{browne2012survey} and we know since the work of \cite{shah2022journal} that these logarithmic bonus are not sufficient to prove convergence. 
In our Stochastic-Power-UCT algorithm, we define the sequence of bonus function as
\[
B(t,s_h,a)  = C\frac{T_{s_{h}}(t)^{\frac{b_{h+1}}{\beta_{h+1}}}}{T_{s_{h},a}(t)^{\frac{\alpha_{h+1}}{\beta_{h+1}}}}, h =0,1,\dots,H-1,
\]
where along the tree from depth $0$ to depth $H$ we maintain $\{b_{i}\}^H_{i=0}$, $\{\alpha_{i}\}^H_{i=0}$, $\{\beta_{i}\}^H_{i=0}$ as {algorithmic constants} satisfy conditions as in Table~\ref{algorithmic_constants}, and dividing by zero assume to be $+\infty$. 
\begin{table}[htb]
\centering
\caption{Conditions for algorithmic constants. $i \in [0,H].$}
\label{algorithmic_constants}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{|c|}
    \hline
    \text{Algorithmic constants, each row as AND conditions} \\
    \hline
    $b_{i} < \alpha_{i}; b_{i}> 2$.\\
    $1\leq p \leq 2; \alpha_{i} \leq \frac{\beta_{i}}{2}$ OR $p > 2; \alpha_{i} \leq \frac{\beta_{i}}{2}; 0<\alpha_{i} - \frac{\beta_{i}}{p} < 1$.\\
    $\alpha_{i}\left(1 -\frac{b_{i} }{\alpha_{i}}\right) \leq b_{i} < \alpha_{i}.$\\
    $\alpha_{i} = (b_{i+1}-1)\left(1-\frac{b_{i+1}}{\alpha_{i+1}}\right).$\\ $\beta_{i} = (b_{i+1}-1)$.\\
    \hline
\end{tabular}
}
\end{table}

Particular choices for the sequences of parameter constants presented above have been proposed based on the theoretical study, which will be described in the next section. We highlight that these choices are the same as the Fixed-Depth-MCTS algorithm from \cite{shah2020non} for $1\leq p \leq 2$ and when $p >2$, an extra condition $0<\alpha_{i} - \frac{\beta_{i}}{p} < 1$ is needed. Furthermore, the Fixed-Depth-MCTS algorithm \cite{shah2020non} has been studied for deterministic settings, while our method is proposed for stochastic environments with general power mean value estimators.
 
As the the Values and Q-values estimates, they are the average of the sum of discounted rewards starting from this state (resp. state-action) obtained in all past trajectories going through this state (resp. state-action). They can be also be computed inductively as follows. 
If $s$ is a leaf of the search tree at depth $h$, $\widehat{V}_{t}(s)$ is the average of $t$ playout obtained by using the playout policy\footnote{note that the playout policy will be called several times only in leaves that are at depth $H$}. For internal nodes, we define inductively, for all $t$, 
\begin{flalign}
 \widehat{V}_{t}(s_h) &=  \left(\sum_{a \in \cA_{s_h}} \frac{T_{s_h,a}(t)}{t}  \left(\widehat{Q}_{T_{s_h,a}(t)}(s_h,a)\right)^{p}\right)^{\frac{1}{p}}, \\
 \widehat{Q}_{t}(s_h,a)  &=  \frac{1}{t}\sum_{i=1}^{t} \left[r^{i}(s_h,a) + \gamma \widehat{V}_{T^{s_{h+1}}_{s_h,a}(i)}(s_{h+1})\right], \label{def:qvalue}
\end{flalign}
where $p \in [1,+\infty)$. We denote $r^{i}(s_h,a)$ is the $i$-th instantaneous reward collected after visiting $(s_h,a)$ in depth $h$. $T^{s_{h+1}}_{s_h,a}(i)$ is the number of visits of $(s_h,a)$ to $s_{h+1}$ after timestep $i$. Detailed can be found at Algorithm~\ref{stochastic_p_uct}.

\begin{remark} We described above the practical implementation of MCTS algorithm, for which sometimes the maximal depth $H$ is sometimes even set to $+\infty$. For the theoretical analysis however, the maximal depth $H$ will be crucial and we will actually analyze a variant of this algorithm that always collects trajectories of length $H$. 
\end{remark}

\section{Theoretical analysis}
Planning in MCTS requires a sequence of decisions along the tree, with each internal node acting as a non-stationary bandit. The empirical mean at these nodes shifts due to the action selection strategy. To address this problem, we first analyze non-stationary multi-armed bandit settings, focusing on the concentration properties of the power-mean backup for each arm compared to the optimal value. We then apply these findings to MCTS.
\subsection{Non-stationary power mean multi-armed bandit}
\label{s:bandit_definition}
We consider a class of non-stationary multi-armed bandit (MAB) problems. Let us consider $K \geq 1$ arms or actions of interest. Let $X_{a,t}$ denote the random reward obtained by playing arm $a \in [K]$ at the time step $t$, the reward is bounded in $[0,R]$.
$\widehat{\mu}_{a,n} = \frac{1}{n}\sum^n_{t=1}X_{a,t}$ is the average reward collected at arm $a$ after n times. Let $\mu_{a,n}=\E[\widehat{\mu}_{a,n}]$. We define
\begin{definition}\label{def:concentration_main} A sequence of estimators $(\widehat{V}_n)_{n\geq 1}$ concentrates at rate $\alpha,\beta$ towards some limit $V$ under certain conditions on $\alpha,\beta$ if there exists a constant $c>0$ such that the following property holds: 
\[
 \forall n\geq 1, \forall \varepsilon > n^{-\frac{\alpha}{\beta}}, \text{}\bP\left(|\widehat{V}_n - V| > \varepsilon\right) \leq c n^{-\alpha}\varepsilon^{-\beta}.
\]
We write $\widehat{V}_n \cv{\alpha,\beta} V$.
\end{definition}

We assume that the reward sequence $\{X_{a,t}\}$ is a non-stationary process satisfying the following assumption:
\begin{manualassumption}{1}\label{assumpt_1_main}
Consider K arms that for $a \in [K]$, let $(\widehat{\mu}_{a,n})_{n\geq 1}$ be a sequence of estimator satisfying
\[
\widehat{\mu}_{a,n}\cv{\alpha,\beta} \mu_a.
\]
\end{manualassumption}
Let us define $\mu_{\star} = \max_{a\in[K]}\{\mu_a\}$. In our study, we assume that $\mu_{\star}$ is unique, and there is a strict gap between the best optimal value and the second best value.
Under Assumption~\ref{assumpt_1_main}, we consider the following optimistic action selection strategy, based on the estimator $\widehat{\mu}_{a,n}$ and using a similar bonus as the one in Stochastic-Power-UCT.  More precisely, the algorithm starts by selecting each arm once. Then, given $b < \alpha, b> 2, \beta > 0$, at each time step $n > K$, the selected action is 
\begin{flalign}
    a_n = \argmax_{a \in \{1...K\}} \bigg \{\widehat{\mu}_{a,T_a(n)} + C\frac{n^{\frac{b}{\beta}}}{T_k(n)^{\frac{\alpha}{\beta}}} \bigg\},\label{action_select}
\end{flalign}
where $T_a(n) = T_a(n)= \sum_{t=1}^{n-1}\ind(a_t = a)$ denotes the number of selections of arm $a$ prior to round $n$. Given a constant $1 \leq p < \infty$, we define \[\widehat{\mu}_n(p) = \left(\sum_{a=1}^{K} \frac{T_a(n)}{n}\widehat{\mu}_{a,T_a(n)}^{p}\right)^{\frac{1}{p}}\] as the power mean value backup operator.

We establish the concentration properties of the average mean backup operator $\widehat{\mu}_n(p)$ towards the mean value of the optimal arm $\mu_*$, as shown in Theorem~\ref{thm:theorem1_main}.

\begin{manualtheorem}{1}\label{thm:theorem1_main} For $a \in [K]$, let $(\widehat{\mu}_{a,n})_{n\geq 1}$ be a sequence of estimators satisfying $\widehat{\mu}_{a,n}\cv{\alpha,\beta} \mu_a$ and let $\mu_\star = \max_{a} \{ \mu_a\}$. Assume that the arms are sampled according to the strategy \eqref{action_select} with parameters $\alpha,\beta, b$ and $C$. Assume that $p,\alpha,\beta$ and $b$ satisfy one of these two conditions: 
\begin{enumerate}
    \item[(i)] $1\leq p \leq 2$ and $\alpha \leq \frac{\beta}{2}$ 
    \item[\ref{two}] $p > 2$ and $0 < \alpha - \frac{\beta}{p} < 1$
\end{enumerate}
If $\alpha\left(1 -\frac{b }{\alpha}\right) \leq b < \alpha$ then the sequence of estimators 
$\widehat{\mu}_n(p)$
satisfies \[\widehat{\mu}_n(p) \cv{\alpha',\beta'} \mu_\star\] for $\alpha' = (b-1)\left(1-\frac{b}{\alpha}\right)$ and $\beta' = (b-1)$ for some value of the constant $C$ in \eqref{action_select} that depends on $K, b, \alpha,p, \Delta_{\min}$ with $\Delta_{\min} = \min_{a : \mu_a < \mu_\star} (\mu_\star - \mu_a)$. 
\end{manualtheorem}
Based upon the results of Stochastic-Power-UCT using power mean as the value backup operator on the described non-stationary multi-armed bandit problem, we derive theoretical results for Stochastic-Power-UCT in an MCTS tree.
\subsection{Monte-Carlo Tree Search}
Based on the results from the non-stationary multi-armed bandit from the last section, we can derive theoretical analysis for the Stochastic-Power-UCT in an MCTS tree where we consider each node in the tree as a Non-stationary multi-armed bandit problem.

We start with a result of the following lemma which plays an important role in the analysis of our MCTS algorithm.
\begin{manuallemma}{1}\label{lem:concQ_main} For $m \in [M]$, let $(\widehat{V}_{m,n})_{n\geq 1}$ be a sequence of estimator satisfying $\widehat{V}_{m,n}\cv{\alpha,\beta} V_m$, and there exists a constant $L$ such that $\widehat{V}_{m,n} \leq L, \forall n \geq 1$. Let $X_i$ be an iid sequence with mean $\mu$ and $S_i$ be an iid sequence from a distribution $p=(p_1,\dots,p_M)$ supported on $\{1,\dots,M\}$. Introducing the random variables $N_m^{n} = \# |\{ i \leq n : S_i = s_m\}|$, we define the sequence of estimator 
\[\widehat{Q}_n = \frac{1}{n}\sum_{i=1}^{n} X_i + \gamma \sum_{m=1}^{M} \frac{N_m^{n}}{n}\widehat{V}_{m,N_m^{n}}.\]
Then with $2\alpha \leq \beta, \beta > 1$, 
\[\widehat{Q}_n \cv{\alpha,\beta} \mu + \sum_{m=1}^{M} p_m V_m.\]
\end{manuallemma}

The proof of Lemma~\ref{lem:concQ_main} can be found in the Appendix. This result is important as it can be used to show that the Q-value estimates at a certain depth $h$ concentrate at the same rate $(\alpha,\beta)$ as the value estimates of the children nodes. Then, thanks to Theorem~\ref{thm:theorem1_main}, the value estimate at depth $h$, which is computed using a power mean, will concentrate at a different rate $(\alpha',\beta')$. Proceeding by induction from depth $H$ to depth $0$ allows us to derive Theorem~\ref{theorem2_main}, which shows the polynomial concentration of the values and Q-values at the root note. 
We note that this part of analysis is fairly similar to the analysis of \cite{shah2022journal}. However, its two main ingredients required some innovation. Indeed, Theorem~\ref{thm:theorem1_main} is specific to our power-mean value back-up operator, while Lemma~\ref{lem:concQ_main} is specific to the concentration of Q values in stochastic MDPs.   

\begin{manualtheorem}{2}\label{theorem2_main}
When we apply the Stochastic-Power-UCT algorithm, with $\{b_{i}\}^H_{i=0}$, $\{\alpha_{i}\}^H_{i=0}$, $\{\beta_{i}\}^H_{i=0}$ as {algorithmic constants} satisfying the conditions in Table~\ref{algorithmic_constants}, we have
\begin{enumerate}
\item For any node $s_h$ at the depth $h^{\text{th}}$ in the tree ($h=[0,1\dots,H]$),
\[\widehat{V}_{n}(s_{h}) \cv{\alpha_{h},\beta_{h}} \widetilde{V}(s_{h}). \label{one}
\]
\item For any node $s_{h}$ at the depth $h^{\text{th}}$ in the tree ($h=[0\dots,H-1]$), 
\[\widehat{Q}_{n}(s_{h},a) \cv{\alpha_{h+1},\beta_{h+1}} \widetilde{Q}(s_{h},a), \text{ for all } a \in \mathcal{A}_{s_{h}}. \label{two}
\]
\end{enumerate}
\end{manualtheorem}
\begin{proof}
We will prove the Theorem by induction on the depth $H$ of the tree. \\
\underline{Initial step $H=1$}.\\
The state at the root node is $s_0$. Let us assume that $r^{t}(s_0,a_k)$ is the intermediate reward at time step $t$, after visiting $(s_0,a_k)$, and go to state $s_1 \sim \mathcal{P}(\cdot|s_0,a_k)$.
Let us assume that $r(s_0, a_k)$ is the mean of $(s_0,a_k)$. 
We recall the definition of $\widetilde{Q}(s_0, a_k)$,
\[
\widetilde{Q}(s_0, a_k) = r(s_0, a_k) + \gamma \sum_{s_1\in \mathcal{A}_{s_0}} \mathcal{P}(s_1|s_0,a_k) \widetilde{V}(s_1) 
\]
where $\widetilde{V}(s_1)$ is the average value of the rollout policy $\pi_0$ at state $s_1$,  $\mathcal{A}_{s_0}$ is the set of feasible actions at state $s_0$, $|\mathcal{A}_{s_0}| = M$, $\mathcal{P}(s_1|s_0,a_k)$ is the probability transition of taking action $a_k$ at state $s_0$ to state $s_1$. 

$\ref{one}$ satisfies for any state $s_1 \sim \mathcal{P}(\cdot|s_0,a_k)$ as
\begin{flalign}
\widehat{V}_{n}(s_{1}) \cv{\alpha_{1},\beta_{1}} \widetilde{V}(s_{1}), \label{value_s_1}
\end{flalign}
because each value at the leaf node $\widehat{V}_{n}(s_{1})$ is the average of i.i.d call to the playout policy $\pi_{0}(s)$.

From \eqref{def:qvalue}, we have
\begin{flalign}
\widehat{Q}_{n}(s_0,a_k)  &=  \frac{1}{n}\sum_{t=1}^{n} \left[r^{t}(s_0,a_k) + \gamma \widehat{V}_{T^{s_{1}}_{s_0,a_k}(t)}(s_{1})\right]\label{def:qvalue_s0}.
\end{flalign}

By applying Lemma~\ref{lem:concQ_main} with $X_t$ is the intermediate reward $r^{t}(s_0,a_k)$ at time $t$, $p=(p_1,p_2,...p_M)$ is the probability transition dynamic of taking action $a_k$ at state $s_0$. 
For $m \in [M]$, each $(\widehat{V}_{m,n})_{n\geq 1}$ at time step $n$ satisfies 
\begin{flalign}
\widehat{V}_{m,n}(s_1) \cv{\alpha_{1},\beta_{1}} \widetilde{V}(s_1), \text{ with } s_1 \in \{s_m\}, m = 1,2,3...M,\nonumber
\end{flalign}
where  $s_m \sim \mathcal{P}(\cdot|s_0,a_k)$, we have
\begin{flalign}
\widehat{Q}_{n}(s_{0},a) \cv{\alpha_{1},\beta_{1}} \widetilde{Q}(s_{0},a), \text{ for all } a \in \mathcal{A}_{s_{0}}.\label{two_depth_0}
\end{flalign}
Therefore at the root node $s_0$, applying Theorem~\ref{thm:theorem1_main}, with the results of (\ref{two_depth_0}) and because 
\begin{flalign}
\widehat{V}_{n}(s_0) =  \left(\sum_{a \in \cA_s} \frac{T_{s_0,a}(n)}{n}  \left(\widehat{Q}_{T_{s_0,a}(n)}(s_0,a)\right)^{p}\right)^{\frac{1}{p}},\label{def:vvalue_depth_0}
\end{flalign}
$p \in [1,+\infty)$, we have 
\begin{flalign}
\widehat{V}_{n}(s_{0}) \cv{\alpha_{0},\beta_{0}} \widetilde{V}(s_{0}),\label{value_s_0}
\end{flalign}
with $\alpha_0, \beta_0$ satisfies conditions in Table~\ref{algorithmic_constants}.
\textit{From (\ref{value_s_1}), (\ref{value_s_0}), we conclude that \ref{one} is correct when the depth of the tree is 1.}

$\ref{two}$ is correct according to (\ref{two_depth_0}).

\ul{Let us assume that the theorem holds with the tree of depth $H-1$.}

\ul{Now let us consider the tree with depth $H$.}

When we take an action $a_k$ at the root node state $s_0$ and get state $s_{1} \sim \mathcal{P}(\cdot|s_0, a_k)$, we go to a subtree with depth $H-1$. According to the induction hypothesis, in the subtree with the root node $s_1$, we have with $h=[1\dots,H]$
\begin{flalign}
\widehat{V}_{n}(s_{h}) \cv{\alpha_{h},\beta_{h}} \widetilde{V}(s_{h}),\label{value_anynode_h}
\end{flalign}
and with $h=[1\dots,H-1]$
\begin{flalign}
\widehat{Q}_{n}(s_{h},a) \cv{\alpha_{h+1},\beta_{h+1}} \widetilde{Q}(s_{h},a), \text{ for all } a \in \mathcal{A}_{s_{h}}.\label{qvalue_anynode_h}
\end{flalign}
We now consider the root node at state $s_0$. 

We apply again Lemma~\ref{lem:concQ_main} with $X_t$ is the intermediate reward $r^{t}(s_0,a_k)$ at time $t$ and each $(\widehat{V}_{m,n})_{n\geq 1}$ at time step $n$ satisfies (because of (\ref{value_anynode_h}))
\begin{flalign}
\widehat{V}_{m,n}(s_1) \cv{\alpha_{1},\beta_{1}} \widetilde{V}(s_1), \text{ with } s_1 \in \{s_m\}, m = 1,2,3...M,\nonumber
\end{flalign}
where  $s_m \sim \mathcal{P}(\cdot|s_0,a_k)$, we have
\begin{flalign}
\widehat{Q}_{n}(s_{0},a) \cv{\alpha_{1},\beta_{1}} \widetilde{Q}(s_{0},a), \text{ for all } a \in \mathcal{A}_{s_{0}}.\label{two_depth_H_1}
\end{flalign}

At the root node $s_0$, We apply again Theorem~\ref{thm:theorem1_main}, with the concentration results of Q value at (\ref{two_depth_H_1}) and the value backup operator at root state $s_0$ (\ref{def:vvalue_depth_0}), we have 
\begin{flalign}
\widehat{V}_{n}(s_{0}) \cv{\alpha_{0},\beta_{0}} \widetilde{V}(s_{0}),\label{value_node_s_0}
\end{flalign}
with $\alpha_0, \beta_0$ satisfies conditions in Table~\ref{algorithmic_constants}.

Combining (\ref{value_anynode_h}) and (\ref{value_node_s_0}) concludes for $\ref{one}$.

Combining (\ref{qvalue_anynode_h}) and (\ref{two_depth_H_1}) concludes for $\ref{two}$.

The results of Theorem~\ref{theorem2_main} hold for any node in the tree with the tree of depth $(H)$. By induction, we can conclude the proof.
\end{proof}

Finally, we state the expected payoff of Value estimation at the root node 
polynomial decays, as shown below.

\begin{manualtheorem} {3 (Convergence of Expected Payoff)}
We have at the root node $s_{0}$, with the best possible parameter tuning that
\begin{flalign}
&\big| \E [\widehat{V}_{n}(s_{0})] - \widetilde{V}(s_{0}) \big| \leq \mathcal{O}(n^{-1/2}). \nonumber
\end{flalign}
\end{manualtheorem}
\begin{proof}
Using the convexity of $f(x) = |x|$ and applying Jensen's inequality we have\begin{flalign}
&  \big| \E [\widehat{V}_{n}(s_{0})] - \widetilde{V}(s_{0}) \big| \leq \E [ \big| \widehat{V}_{n}(s_{0})] - \widetilde{V}(s_{0}) \big| ] \nonumber \\
&=  \int^{+\infty}_0 \bP \left( \left| \widehat{V}_{n}(s_{0}) - \widetilde{V}(s_{0}) \right| \geq s \right) ds \nonumber \\
&\leq  \int^{n^{-\frac{\alpha_{0}}{\beta_{0}}}}_0 1 ds + \int^{+\infty}_{n^{-\frac{\alpha_{0}}{\beta_{0}}}} c_{0} n^{-\alpha_{0}}s^{-\beta_{0}} ds \nonumber \\
&\leq n^{-\frac{\alpha_{0}}{\beta_{0}}} + c_{0} n^{-\alpha_{0}} \left( \frac{s^{-\beta_{0} + 1}}{-\beta_{0} + 1}  \right)\Big|^{+\infty}_{n^{-\frac{\alpha_{0}}{\beta_{0}}}} \nonumber \\
&= ( \frac{c_{0}}{\beta_{0} - 1}  +1) n^{-\frac{\alpha_{0}}{\beta_{0}}}. \nonumber
\end{flalign}
Because $\frac{\alpha_{0}}{\beta_{0}} \leq \frac{1}{2}$ (Theorem~\ref{thm:theorem1_main}), then the best possible rate we can estimate is 
\begin{flalign}
&\big| \E [\widehat{V}_{n}(s_{0})] - \widetilde{V}(s_{0}) \big| \leq \mathcal{O}(n^{-1/2}). \nonumber
\end{flalign}
That concludes the proof.
\begin{remark}\label{remark_2}
These results demonstrate that both Stochastic-Power-UCT and Fixed-Depth-MCTS share the same convergence rate for value estimation at the root node, which is $\mathcal{O}(n^{-1/2})$. By selecting algorithmic constants from Table~\ref{algorithmic_constants} such that $\frac{\alpha_{i}}{\beta_{i}} = 1/2$ and $\frac{b_{i}}{\beta_{i}} = 1/4$ for $i\in[0,H]$, we achieve the optimal rate. This choice leads us to adopt the exploration bonus:
\[
B_{h}(n,s,a) = C\frac{n^{1/4}}{T_{s,a}(n)^{1/2}},
\] where $C$ represents an exploration constant. Our findings align with those of~\cite{shah2022journal}, but our finding more broadly applies to the power mean estimator, and the average mean is a special case.
\end{remark}
\end{proof}
\section{Experiments}
In this section, we present experimental results demonstrating the numerical advantages of Stochastic-Power-UCT compared to UCT~\citep{kocsis2006improved}, Power-UCT~\citep{dam2019generalized} and Fixed-Depth-MCTS~\citep{shah2022journal} in SyntheticTree, FrozenLake ($4\times4$), FrozenLake ($8\times8$) and Taxi environments. The discount factor is set $\gamma=1$ in SyntheticTree and $\gamma=0.99$ in FrozenLake and Taxi environments. Hyperparameter can be found in the Appendix.

As the results from Remark~\ref{remark_2}, the exploration bonus is chosen as $C\frac{n^{1/4}}{T_{s,a}(n)^{1/2}}$ with C is an exploration constant in all environments. In SyntheticTree, we run further experiments with adaptive choice of parameters $\alpha_i, \beta_i, b_i$ for $i \in [0,H]$ satisfied Table~\ref{algorithmic_constants} to confirm the theoretical study.\\\\
\textbf{Synthetic Tree}

\begin{figure*}
    \centering
    \subfloat[Compare Stochastic-Power-UCT with difference $p$ value: $p=1.0$(Fixed-Depth-MCTS), $2.0$, $4.0$, $6.0$, $8.0$, $10.0$, $16.0$. The exploration bonus is chosen as $C\frac{n^{1/4}}{T_{s,a}(n)^{1/2}}$ with C as an exploration constant.]{\includegraphics[width=0.95\textwidth]{images/synthetic_tree_power_stochastic.jpg}\label{F:stochastic_power_uct}}
    \\    
    \subfloat[Compare UCT, Power-UCT, Fixed-Depth-MCTS and Stochastic-Power-UCT. The exploration bonus is chosen as $C\frac{n^{1/4}}{T_{s,a}(n)^{1/2}}$ with C as an exploration constant.]{\includegraphics[width=0.8\textwidth]{images/synthetic_tree.jpg}\label{F:all}} 
    \\
    \subfloat[Compare Stochastic-Power-UCT with the exploration bonus $C\frac{n^{\frac{b_i}{\beta_i}}}{T_k(n)^{\frac{\alpha_i}{\beta_i}}}$ where the adaptive parameters of $\{\alpha_i\}^H_0, \{\beta_i\}^H_0, \{b_i\}^H_0$ and $p$ with difference initial $\beta_H = 120,130,140$ value chosen accordingly satisfied Table~\ref{algorithmic_constants}.]{\includegraphics[width=0.8\textwidth]{images/synthetic_tree_adaptive.jpg}\label{F:adaptive}}
\caption{We show the convergence of the value estimate at the root node to the respective optimal in Synthetic tree environment.}
\label{F:synthetic_tree_all_plots}
\end{figure*}

\begin{table*}[!ht]
\caption{Mean and two times standard deviation of discounted total reward, over $1000$ evaluation runs, of UCT, Fixed-Depth-MCTS($p=1$) and Stochastic-Power-UCT($p=2$, and $p=2.2$) in FrozenLake ($4\times4$), FrozenLake ($8\times8$), and Taxi environments (in Taxi, we perform 20 evaluation runs). Top row: number of simulations at each time step. Bold denotes no statistically significant difference to the highest mean (t-test, $p < 0.05$).}
\centering
\subfloat[FrozenLake 4x4\label{tab:frozenlake-4x4}]{
\smallskip
\resizebox{2.\columnwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|}\hline
Algorithm & $2048$ & $4096$ & $8192$ & $16384$ &$32768$ & $65536$ & $131072$ & $262144$ \\\cline{1-9}
UCT & $\mathbf{0.10 \pm 0.01}$ & $\mathbf{0.13 \pm 0.01}$ & $\mathbf{0.20 \pm 0.02}$ & $0.27 \pm 0.02$ & $0.37 \pm 0.02$ & $\mathbf{0.43 \pm 0.02}$ & $\mathbf{0.44 \pm 0.02}$ & $\mathbf{0.44 \pm 0.02}$\\\cline{1-9}
$p=1$ & $\mathbf{0.11 \pm 0.01}$ & $0.15 \pm 0.02$ & $\mathbf{0.20 \pm 0.02}$ & $0.29 \pm 0.02$ & $0.35 \pm 0.02$ & $\mathbf{0.41 \pm 0.02}$ & $\mathbf{0.45 \pm 0.02}$ & $\mathbf{0.48 \pm 0.02}$\\\cline{1-9}
$p=2$ & $\mathbf{0.15 \pm 0.02}$ & $\mathbf{0.21 \pm 0.02}$ & $\mathbf{0.31 \pm 0.02}$ & $\mathbf{0.37 \pm 0.02}$ & $\mathbf{0.39 \pm 0.02}$ & $\mathbf{0.44 \pm 0.02}$ & $\mathbf{0.45 \pm 0.02}$ & $\mathbf{0.47 \pm 0.02}$\\\cline{1-9}
$p=2.2$ & $\mathbf{0.16 \pm 0.02}$ & $\mathbf{0.23 \pm 0.02}$ & $\mathbf{0.30 \pm 0.02}$ & $\mathbf{0.37 \pm 0.02}$ & $\mathbf{0.40 \pm 0.02}$ & $\mathbf{0.42 \pm 0.02}$ & $\mathbf{0.45 \pm 0.02}$ & $\mathbf{0.50 \pm 0.02}$\\\cline{1-9}
\end{tabular} 
}
}

\subfloat[FrozenLake 8x8\label{tab:frozenlake-8x8}]{
\smallskip
\resizebox{2.\columnwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|}\hline
Algorithm & $1024$ & $2048$ & $4096$ & $8192$ & $16384$ &$32768$ & $65536$ & $131072$ \\\cline{1-9}
UCT & $\mathbf{0.01 \pm 0.006}$ & $\mathbf{0.02 \pm 0.007}$ & $\mathbf{0.05 \pm 0.01}$ & $0.07 \pm 0.01$ & $0.12 \pm 0.01$ & $\mathbf{0.18 \pm 0.01}$ & $\mathbf{0.22 \pm 0.01}$ & $\mathbf{0.29 \pm 0.01}$\\\cline{1-9}
$p=1$ & $\mathbf{0.02 \pm 0.006}$ & $0.02 \pm 0.008$ & $\mathbf{0.06 \pm 0.001}$ & $0.07 \pm 0.01$ & $0.10 \pm 0.01$ & $\mathbf{0.17 \pm 0.01}$ & $\mathbf{0.23 \pm 0.01}$ & $\mathbf{0.29 \pm 0.01}$\\\cline{1-9}
$p=2$ & $\mathbf{0.02 \pm 0.006}$ & $\mathbf{0.04 \pm 0.09}$ & $\mathbf{0.06 \pm 0.01}$ & $\mathbf{0.09 \pm 0.01}$ & $\mathbf{0.14 \pm 0.01}$ & $\mathbf{0.21 \pm 0.01}$ & $\mathbf{0.25 \pm 0.01}$ & $\mathbf{0.33 \pm 0.01}$\\\cline{1-9}
$p=2.2$ & $\mathbf{0.01 \pm 0.006}$ & $\mathbf{0.04 \pm 0.009}$ & $\mathbf{0.06 \pm 0.01}$ & $\mathbf{0.10 \pm 0.01}$ & $\mathbf{0.12 \pm 0.01}$ & $\mathbf{0.19 \pm 0.01}$ & $\mathbf{0.26 \pm 0.01}$ & $\mathbf{0.31 \pm 0.01}$\\\cline{1-9}
\end{tabular} 
}
}

\subfloat[Taxi\label{tab:taxi}]{
\smallskip
\resizebox{1.6\columnwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|}\hline
Algorithm & $512$ & $1024$ & $2048$ & $4096$ & $8192$ & $16384$ \\\cline{1-7}
UCT & $1.03 \pm 0.68$ & $\mathbf{1.20 \pm 0.56}$ & $1.28 \pm 0.54$ & $\mathbf{1.25 \pm 0.69}$ & $\mathbf{1.32 \pm 0.54}$ & $\mathbf{1.66 \pm 0.83}$\\\cline{1-7}
$p=1$ & $\mathbf{0.69 \pm 0.24}$ & $1.11 \pm 0.76$ & $2.22 \pm 1.01$ & $1.63 \pm 0.82$ & $1.52 \pm 0.53$ & $1.96 \pm 1.04$\\\cline{1-7}
$p=2$ & $\mathbf{0.63 \pm 0.36}$ & $0.92 \pm 0.54$ & $\mathbf{1.72 \pm 0.81}$ & $\mathbf{1.49 \pm 0.64}$ & $2.24 \pm 0.83$ & $\mathbf{2.94 \pm 0.95}$\\\cline{1-7}
$p=2.2$ & $\mathbf{0.85 \pm 0.45}$ & $\mathbf{0.76 \pm 0.47}$ & $\mathbf{1.22 \pm 0.68}$ & $1.15 \pm 0.49$ & $\mathbf{2.45 \pm 0.90}$ & $3.07 \pm 0.98$\\\cline{1-7}
\end{tabular}
}
}
\label{tab:frozenlake-taxi-results}
\end{table*}

We evaluate Power-UCT using the synthetic tree toy problem~\cite{dam2021convex}. The problem involves a tree with depth $d$ and branching factor $k$. Each edge of the tree has a random value between $0$ and $1$, and at each leaf, a Gaussian distribution is used as an evaluation function resembling the return of random rollouts. The mean of the Gaussian distribution is the sum of the values assigned to the edges connecting the root node to the leaf, while the standard deviation is set to a constant $\sigma$ (we set $\sigma=0.5$ in our experiments). After trying different values,
To ensure stability, the means are normalized between 0 and 1. We introduce stochasticity into the environment by altering the transition probabilities: there is a $80\%$ chance of moving to the intended node and a $20\%$ chance of moving to a different node with equal probability. We conduct 25 experiments on five trees with five runs each, covering all combinations of branching factors $k=\lbrace2,4,6,8,10,16\rbrace$ and depths $d=\lbrace1,2,3,4\rbrace$. 
We compute the value estimation error at the root node.

Fig.~\ref{F:synthetic_tree_all_plots} shows the convergence of the value estimations at the root node in the Synthetic Tree environment with different settings. In detail, Fig.~\ref{F:stochastic_power_uct} shows the performance of Stochastic-Power-UCT with different values of $p$, where we find that $p=2$ outperforms all other $p$ values. In Fig~\ref{F:all}, We compare Stochastic-Power-UCT with UCT, Power-UCT, and Fixed-Depth-MCTS. We also find that $p=2$ works the best. In Fig~\ref{F:stochastic_power_uct} and Fig~\ref{F:all}, we choose the exploration bonus $C\frac{n^{1/4}}{T_{s,a}(n)^{1/2}}$. In Fig~\ref{F:adaptive}, we use the exploration bonus $C\frac{n^{b_i/\beta_i}}{T_{s,a}(n)^{\alpha_i/\beta_i}}, i\in [0,H]$ with $\alpha_i, \beta_i, b_i$ satisfied Table~\ref{algorithmic_constants}. The convergence results of Stochastic-Power-UCT shown in Fig~\ref{F:synthetic_tree_all_plots} confirm the theoretical study. \\\\
\textbf{Frozen Lake}

In the OpenAI Gym~\citep{brockman2016openai}, the \textit{FrozenLake} problem presents a classic empirical MDP environment. The goal is to guide an agent through an ice-grid world, avoiding unstable spots that lead to water. The environment's stochastic nature adds challenge, as the agent moves in the intended direction only one-third of the time, and otherwise in one of two tangential directions. Reaching the target earns a reward of $+1$, while other outcomes yield zero reward. In Table.~\ref{tab:frozenlake-4x4} and Table.~\ref{tab:frozenlake-8x8}, Stochastic-Power-UCT ($p=2$) outperforms UCT and Fixed-Depth-MCTS ($p=1$) with $2^{14},2^{15}$ rollouts in FrozenLake $4\times4$, and $2^{13},2^{14}$ rollouts in FrozenLake $8\times8$. In most cases, Stochastic-Power-UCT ($p=2$) has the average mean higher than others.
\\\\
\textbf{Taxi}
In the \textit{Taxi} environment \cite{chainenv}, agents navigate a 7x6 grid from the top left to the top right, encountering walls that block movement. Simply reaching the end yields no reward; the agent must collect three passengers scattered across the grid before reaching the target position. Rewards vary based on the number of passengers collected and delivered successfully. We introduce stochasticity by setting 50\% chance of moving to the intended direction and 50\% chance of moving to other directions with equal probability. This stochastic environment necessitates thorough exploration.

As shown in Table.~\ref{tab:taxi}, when we increase the number of rollouts, Stochastic-Power-UCT ($p=2$) and Stochastic-Power-UCT ($p=2.2$) outperforms UCT and Fixed-Depth-MCTS ($p=1$) with $2^{13},2^{14}$ rollouts.
\begin{manualremark}{3}
In our experiment, we find that when $p=2$, Stochastic-Power-UCT consistently outperforms UCT, Fixed-Depth-MCTS ($p=1$) and outperform Stochastic-Power-UCT with other $p$ value.
\end{manualremark}
\section{Conclusion}
Monte Carlo tree search (MCTS) is emerging as an effective approach with many applications in games and Autonomous car driving, Robot path planning, and robot assembly tasks. However, understanding of the theoretical foundations of MCTS remains limited. In this work, we introduce Stochastic-Power-UCT, using power mean as the value estimation and a polynomial exploration bonus term, which is specifically designed for stochastic MDP scenarios. Our contribution extends to a thorough theoretical study of the convergence rate of $\mathcal{O}(n^{-1/2})$ for value estimation at the root node of Stochastic-Power-UCT.
Moreover, empirical validation of our theoretical findings is performed in SyntheticTree and various stochastic MDP environments, confirming the theoretical claims of our approach. Our work put one more step for future research efforts aimed at improving the theoretical understanding and practical applicability of MCTS in stochastic environments. One can think of extending our work by studying Power-UCT in adversarial settings. Furthermore, hybrid combination of learning in reinforcement learning and planning in MCTS could be promising with applications in robotics.


\textbf{Acknowledgments}\\\\
This work has been supported by the French Ministry of Higher Education and Research, the Hauts-de-France region, Inria, the MEL, the I-Site ULNE regarding project RPILOTE-19-004-APPRENF, the Inria A.Ex. SR4SG project, and the Inria-Kyoto University Associate Team “RELIANT”.

\bibliography{refs}

\input{appendix}
\end{document}
