
\input{figure/fig_environments}


In this section, we evaluate our method on environments with expansive combinatorial action spaces. Our investigation focuses on (1) whether the proposed MCTS with state-conditioned action abstraction improves the sample efficiency of vanilla MuZero (\Cref{fig:aggregate_metrics,fig:experiments,fig:num_simulations}), (2) whether our method successfully captures compositional relationships between the state and sub-actions (\Cref{fig:visualization-action-abstraction,fig:bandit,fig:mask_probs,fig:gradcam}), and (3) how much the action abstraction contributes to the sample efficiency  (\Cref{fig:ablation_studies,fig:recon,table:recon_loss}).\footnote{Our code is available at \url{https://github.com/yun-kwak/efficient-mcts}.}


\subsection{Experimental Setup}
Following the implementation \citep{schrittwieser2020mastering, schrittwieser2021online}, we augment MuZero by incorporating the proposed \csi{}. We use vanilla  MuZero as a baseline throughout the experiments.



\paragrapht{Implementation.} We use the abstraction threshold $\tau = 0.01$ for all experiments. For all environments, we train our method and MuZero over $100$k gradient steps and evaluate the performance of each run for every 2000 steps with 32 seeds. All experiments were executed on an NVIDIA RTX 3090 GPU, leveraging JAX and Haiku. We provide implementation details in \Cref{appendix:implementation}. Additional experimental results are provided in \Cref{appendix:experiments}.


\subsubsection{Environments}

\paragrapht{DoorKey \textnormal{\citep{gym_minigrid}}.} 
We modify the MiniGrid DoorKey environment to introduce a factored action space (\Cref{fig:environments_colored_doorkey}).
The task is to obtain a key, open a door, and ultimately reach a designated goal,  following the shortest path. The action space is factorized as $\mathcal{A} = \mathcal{A}_{\text{turn}} \times \mathcal{A}_{\text{forward}} \times \mathcal{A}_{\text{pick}} \times \mathcal{A}_{\text{open}}$, where $\{\gA_{\text{turn}}, \gA_{\text{forward}}\}$ correspond to the movement of the agent. $\gA_{\text{pick}}$ and $\gA_{\text{open}}$ correspond to the interaction with the key and door, respectively. The agent receives a reward of $-0.1$ for each step taken. The configuration of the door, key, wall, and the initial position of the agent is randomly initialized at the beginning of each episode. The attributes of the door and the key are also randomly initialized in each episode. We design three settings: \textsc{Easy}, \textsc{Normal}, and \textsc{Hard}, where the number of the attributes is 2, 3, and 4, respectively. The action cardinality for each setting is 54, 96, and 150. Details of the DoorKey are provided in \Cref{appendix:environment-doorkey}.

\paragrapht{Sokoban \textnormal{\citep{SchraderSokoban2018}}.} 
It is a challenging environment that requires long-horizon planning where the agent must manipulate a box to a designated target location through a series of actions (\Cref{fig:environments_colored_sokoban}). The map topology, goal location, and box attribute are randomly initialized for each episode. The agent receives a reward of $-0.1$ for each step taken and a reward of $10$ for successfully placing the box on the target location. The action space is factorized as
$\gA = \gA_{\text{move}} \times \prod_i \gA_{\text{box}}^{(i)}$,
where each $\gA_{\text{box}}^{(i)}$ represents the manipulation of the corresponding box. Similar to DoorKey, we design three settings: \textsc{Easy}, \textsc{Normal}, and \textsc{Hard}, where the number of the attributes of the box is 2, 3, and 4, respectively. The action cardinality for each setting is 45, 135, and 405. Details of the Sokoban are provided in \Cref{appendix:environment-sokoban}.

\input{figure/fig_aggregate_metrics}



\input{figure/fig_learning_curve}

\subsection{Results}


\paragrapht{Sample efficiency (\Cref{fig:experiments}).}
We measure the episodic returns of our method and MuZero in each environment. While MuZero struggles with vast combinatorial action space, our method achieves near-optimal performance in all environments. We observe that the gap between our method and MuZero becomes more pronounced as it gets harder, i.e., as the action cardinality increases. In particular, our method successfully solves \textsc{Hard} settings, where the action cardinality is 150 for DoorKey and 405 for Sokoban. Following the suggestions from \citet{agarwal2021deep}, we also report aggregate scores across all runs on DoorKey and Sokoban environments in \Cref{fig:aggregate_metrics}. The normalized scores for each environment are shown in \Cref{fig:teaser}, illustrating that our method remains effective under the expansive action space.

\input{figure/fig_qualitative_analysis}


\paragrapht{Visualization of the state-conditioned action abstraction (\Cref{fig:visualization-action-abstraction}).}
The \csi{} $h$ learns CSI relationships between state and action variables. We visualize the output $h(z) = [p_z^{1}, \cdots, p_z^{n}] \in [0, 1]^n$ (\Cref{eq:h}) on different states in DoorKey. We first recall that the action is factorized as $\mathcal{A} = \mathcal{A}_{\text{turn}} \times \mathcal{A}_{\text{forward}} \times \mathcal{A}_{\text{pick}} \times \mathcal{A}_{\text{open}}$ and the agent always turns first, advances, and then either picks up a key or opens a door. \Cref{fig:visualization-action-abstraction}-(a) shows that the sub-actions corresponding to the key and the door are assigned almost zero probability. This is because the agent has already obtained a key and cannot interact with the door at this moment. In \Cref{fig:visualization-action-abstraction}-(b), the agent is able to pick up the key, and thus, the probability close to 1 is assigned to the sub-action corresponding to the key. Since it still cannot interact with the door, our method accurately predicts the corresponding probability close to 0. Similarly, \Cref{fig:visualization-action-abstraction}-(c) is also the case where the agent has already obtained a key and opened the door, and consequently, the corresponding sub-actions become irrelevant. Finally, in \Cref{fig:visualization-action-abstraction}-(d), the agent is able to interact with the door. Our model assigns the probability of 0 and 1 to the sub-action corresponding to the key and the door, respectively.




\input{figure/fig_bandit}

\paragrapht{Evaluation of CSI discovery (\Cref{fig:bandit}).}
We evaluate the performance of the \csi{} using the Structural Hamming Distance (SHD) \citep{acid2003searching,ramsey2006adjacency}, which measures the difference between two directed graphs. A lower SHD indicates that the two graphs are more similar in structure, whereas an SHD of $0$ means the graphs are identical. As the environments in our main experiments (i.e., DoorKey and Sokoban) do not provide the ground truth CSIs, we designed the contextual multi-armed bandit scenario for the evaluation with SHD. As shown in \Cref{fig:bandit_shd}, our method successfully identifies CSI relationships. Furthermore, \Cref{fig:bandit_learning_curve} illustrates that the performance of our method measured with the episodic return becomes more improved as it becomes more accurately identifies CSIs. This demonstrates the importance of capturing compositional structures and state-conditioned action abstraction for our method. We provide additional details in \Cref{appendix:environment-cmab}.

\input{figure/fig_mask_probs}

\paragrapht{\Csi{} $h$ (\Cref{fig:mask_probs}).} 
We further investigate the behavior of the \csi{} by examining how its inference on the compositional relationships changes along the trajectory. Interestingly, when the agent is close to the key but has not picked it up yet ($t=2$), our method (\Cref{eq:h}) infers $p_z^{\text{pick}}=1$ and $p_z^{\text{open}}=0$. This is because it cannot open the door at the moment, and thus, the corresponding sub-action $A^{\text{open}}$ is irrelevant. Then, it proceeds to obtain the key and starts to move toward the door ($t=5$). From this moment, the prediction of the \csi{} begins to change. When the agent comes close to the door ($t=7$) and finally opens it ($t=10$), our method predicts $p_z^{\text{pick}}=0$ and $p_z^{\text{open}}=1$. This illustrates that our method effectively captures the compositional relationships between the current state and action variables that change across different states and time-steps.

\paragrapht{Ablation study (\Cref{fig:ablation_studies}).} It is clear that the reconstruction loss is crucial for both our method and MuZero. In addition, the performance improvement of using the state-conditioned action abstraction illustrates that it significantly contributes to the superior sample efficiency of our method. In fact, our method without action abstraction performs similar to MuZero, indicating that masked latent dynamics modeling alone does not bring any performance gain. 



\input{figure/fig_ablation}

\paragrapht{Simulation budgets (\Cref{fig:num_simulations}).}
We compare our method with MuZero in DoorKey across a varying number of simulations. Our method consistently outperforms MuZero and achieves near-optimal performance in all budgets. These results illustrate the effectiveness of state-conditioned action abstraction guided by \csi{}, leading to superior sample efficiency and scalability.

\paragrapht{Latent dynamics model (\Cref{table:recon_loss,fig:recon}).}
We examine the latent dynamics model to investigate whether the superior sample efficiency of our method comes from the dynamics model or state-conditioned action abstraction. 
As shown in \Cref{table:recon_loss}, the latent dynamics model of MuZero achieves slightly better performance in terms of the reconstruction loss. This is because our method imposes the dynamics model to use only some of the action variables for prediction.
This illustrates that the superior performance of our method does not come from the latent dynamics model but state-conditioned action abstraction. We further investigate the reconstructed observations from MuZero and ours.
As shown in \Cref{fig:recon}, both dynamics models perform reasonably well, demonstrating the effectiveness of the state-conditioned action abstraction.

\input{figure/fig_num_simulations}


\paragrapht{GradCAM (\Cref{fig:gradcam}).}
We visualize the learned \csi{} using GradCAM \citep{Selvaraju2016GradCAMVE} on DoorKey. We use the gradient of $p_z^i$ for the sub-action $A^i$ with respect to the last feature map. We observe that our method places attention on the related object to decide whether the corresponding sub-action is relevant or not. For example, in \Cref{fig:gradcam_key}, the network focuses on the placed key and the position of the agent to determine the probability assigned to the sub-action $A^{\text{key}}$ corresponding to the key. Similarly, \Cref{fig:gradcam_door} shows that our model focuses on the door to infer the state-conditioned dependency of the sub-action $A^{\text{door}}$ corresponding to the door.