
A Markov Decision Process (MDP) \citep{bellman1957markovian,puterman2014markov} is defined by a tuple $\langle\gS, \gA, P, R, \gamma\rangle$, where $\gS$ is a state space, $\gA$ is an action space, $P: \gS \times \gA \rightarrow \gS$ is a transition function, $R: \gS \rightarrow \mathbb{R}$ is a reward function, and $\gamma \in [0, 1)$ is a discount factor. 
We consider a discrete action space and deterministic transition. The goal is to find a policy $\pi: \gS \times \gA \rightarrow [0, 1]$ that maximizes the expected cumulative reward. In this paper, we consider the factored action space $\gA = \gA^1 \times \cdots \times \gA^n$ where the action $A$ is composed of sub-actions $A^i$, i.e., $A=[A^1, \cdots, A^n]$, and each $A^i$ takes values from a set $\gA^i$. We will sometimes denote sub-actions as action variables. Throughout the paper, we use capital and small letters to represent the random variables and their assignments, respectively.

\subsection{Monte Carlo Tree Search}
\label{sec:preliminary_mcts}

Monte Carlo Tree Search (MCTS) \citep{browne2012survey,coulom2006efficient} incrementally builds a search tree to find the best decision from a given state. The algorithm's strength lies in its balance between exploring previously underexplored actions and exploiting actions with high estimated rewards. 
Typically, MCTS iteratively repeats the simulations, which consist of four main stages: \textit{selection}, \textit{expansion}, \textit{evaluation}, and \textit{backup}. Beginning at the root node of the search tree, which represents the current state, MCTS traverses the tree by successively selecting the most promising node until a leaf node is reached. 


In our work, we consider a variant of MCTS proposed in MuZero \citep{schrittwieser2020mastering}, which incorporates policy and value networks with a latent dynamics model to guide tree search effectively when the true environment model is unknown and the agent receives a high-dimensional observation (i.e., pixels). We now briefly describe each model component and the search procedure of MuZero.

\paragrapht{Model components.}
The encoder $f$ embeds each state $s_t$ into a latent space, i.e., $z_t=f(s_t)$. Here, the state could be a high-dimensional observation, such as an image. The latent dynamics model $g$ maps the latent state $z_{t}$ and an action $a_t$ into a next latent state, i.e., $\hat{z}_{t+1}=g(z_{t}, a_{t})$. At each time step $t$, it builds a search tree starting from $z_t$ as a root node and recursively selects an action for each node.

\paragrapht{Selection.}
At each node $z$, the action selection is:
\begin{equation}
\label{eq:puct}
\hat{a} = \argmax_a \bigg[ Q(z, a) + c \cdot \pi_\theta(z, a) \frac{\sqrt{\sum_b N(z, b)}}{1 + N(z, a)} \bigg],
\end{equation}
where the estimated Q-value, policy prior, and visit count are denoted as \(Q(z, a)\), \(\pi_\theta(z, a)\), and \(N(z, a)\), respectively. 
Here, \(c = c_1+\log \left(\frac{\sum_b N(z, b)+c_2+1}{c_2}\right)\) is an exploration coefficient where $c_1$ and $c_2$ are hyperparameters and $\sum_b N(z, b)$ represents the total number of visits for all actions from state \(z\).
The learnable policy prior \(\pi_\theta(z, a)\) guides the search towards promising actions.

\paragrapht{Expansion.}
If there is no child node corresponding to the selected action during the tree traversal, the latent dynamics model predicts the subsequent latent state and adds it to the search tree as a child node of the current node.

\paragrapht{Evaluation.}
After expanding the search tree, it evaluates a reward and a value from the expanded node. Also, a policy prior is estimated for later use in the \textit{selection} stage.

\paragrapht{Backup.} At the end of a simulation, it updates the visit count and Q-value estimation of the selected nodes along the path from the root to the expanded node. Each $Q(z, a)$ is updated based on the value of the expanded node and the cumulative rewards along the path to the expanded node.

\paragrapht{Training.}
The latent dynamics model, encoder, reward, policy, and value networks are jointly trained to predict the policy, value, and reward targets.
Specifically, it encodes the state $s_t$ into $z_t$ and unrolls the dynamics model, constructing $\hat{z}_{t+1}, \cdots, \hat{z}_{t+k}$. It then predicts the policy, value, and rewards on each $\hat{z}_{t+i}$.
They are supervised to estimate the bootstrapped value, the reward, and the visit count distribution which is the normalized number of visits for each action from the MCTS over the states $s_{t+1}, \ldots, s_{t+k}$.

\subsection{Context-Specific Independence}
Our main goal is to improve the efficiency of MCTS in environments with a factored action space. The key motivation is that some of the action variables do not influence the transition in the current state. The notion of context-specific independence (CSI) \citep{boutilier2013contextspecific} provides a way to understand such relationships.

\begin{definition}[Context-Specific Independence]
We say $Y$ is \textit{contextually independent} of $W$ given the context $X=x$ if $p(y\mid x, v, w) = p(y\mid x, v)$ holds for all $y\in \gY$ and $(v, w)\in \gV\times\mathcal{W}$ whenever $p(x, v, w)>0$.
This is denoted by $Y \Perp W \mid X=x, V$.
\end{definition}

We are concerned with CSI relationship between the current state $s$ and action variables $A=[A^1, \cdots A^n]$, written as:
\begin{equation*}
S' \Perp A\setminus A_{M} \mid S = s, A_{M}, 
\end{equation*}
where $M = \{j_1, \cdots, j_m\}\subseteq [n]$, $A_M = [A^{j_1}, \cdots, A^{j_m}]$. Note that this only holds in the current state $s$ and does not generally hold. In other words, sub-actions that influence the state transition may vary across different states. 

Existing approaches to capture such compositional structures between the state and sub-actions often rely on \textit{true} environment model \citep{chitnis2021camps}, e.g., using conditional independence tests. However, it is impractical for a high-dimensional observation, e.g., image, and more importantly, it is unavailable in many scenarios, e.g., healthcare. 
