\section{Environmental details}
\label{appendix:environment}

For the DoorKey and Sokoban environments, the top-down view of the map is given as the image observation to the agent.

\subsection{DoorKey}\label{appendix:environment-doorkey}

We modify the MiniGrid DoorKey environment \citep{gym_minigrid} to incorporate a factored action space. The task is first to obtain a key, open a door, and finally reach a goal position. At the initial state, the door is locked, and the agent is positioned on the opposite side of the goal. A wall with a door embedded in it separates the initial position and the goal. The action space is factorized as $\mathcal{A} = \mathcal{A}_{\text{turn}} \times \mathcal{A}_{\text{forward}} \times \mathcal{A}_{\text{pick}} \times \mathcal{A}_{\text{open}}$, where $\mathcal{A}_{\text{turn}}=\{\text{no-op}, \text{turn left}, \text{turn right}\}$, $\mathcal{A}_{\text{forward}}=\{\text{no-op}, \text{move forward}\}$, $\mathcal{A}_{\text{pick}}=\{\text{no-op}, \text{pick red key}, \text{pick blue key}, \ldots\}$, and $\mathcal{A}_{\text{open}}=\{\text{no-op}, \text{open red door}, \text{open blue door}, \ldots\}$. The ``$\text{no-op}$'' action denotes a choice to perform no operation, i.e., ``do nothing'', for that particular action set. If the number of colors is 4, the cardinality of the action space is $\lvert \gA \rvert = 3 \times 2 \times 5 \times 5 = 150$. We introduce difficulty levels (\textsc{Easy}, \textsc{Normal}, \textsc{Hard}) corresponding to two, three, and four colors, respectively. The action cardinality for each setting is 54, 96, and 150. The agent always turns first, then progresses by advancing, and finally completes the action by retrieving a key and/or opening a door. The horizon $H$ is 1440, and the agent receives a reward of $-0.1$ for each step. The configurations of the door, key, wall, and the initial position of the agent are randomly initialized at the beginning of each episode. We conducted experiments with a $12\times12$ size for primary results (\Cref{fig:aggregate_metrics,fig:experiments}) and an $8\times8$ size for supplementary studies (\Cref{fig:visualization-action-abstraction,fig:num_simulations,fig:gradcam}).

\subsection{Sokoban}\label{appendix:environment-sokoban}

Sokoban \citep{gym_minigrid} is a challenging environment that requires long-horizon planning. Here, the task is to move a box to a designated target location through a series of actions. We modify the environment to incorporate a factored action space and generate image observations using the tiny world rendering mode with a handful of visual complexity. The influence exerted on the box is determined solely by a single box-action variable, which depends on the color of the box. In addition, the map configuration, goal location, and box color attribute are randomly initialized at the beginning of each episode. With a horizon of $H=150$, the agent receives a reward of $-0.1$ for each step taken and a reward of $+10$ for successfully placing the box on the target location. The action space is factorized as $\mathcal{A} = \mathcal{A}_{\text{move}} \times \mathcal{A}_{\text{box}}^{\text{red}} \times \mathcal{A}_{\text{box}}^{\text{blue}} \times \cdots$, where $\mathcal{A}_{\text{move}}=\{\text{no-op}, \text{move up}, \text{move down}, \text{move left}, \text{move right}\}$ and $\mathcal{A}_{\text{box}}^{i}=\{\text{no-op}, \text{push}, \text{pull}\}$ for all $i \in \{\text{red}, \text{blue}, \ldots\}$. If the number of colors is 4, the cardinality of the action space is $\lvert \gA \rvert = 5 \times 3^{4} = 405$. We introduce difficulty levels (\textsc{Easy}, \textsc{Normal}, \textsc{Hard}) corresponding to two, three, and four colors, respectively. The action cardinality for each setting is 45, 135, and 405.

\subsection{Contextual Multi-armed Bandit}\label{appendix:environment-cmab}
Our method aims to identify CSI relationships. To rigorously test its performance in capturing CSIs, we employ a Contextual Multi-Armed Bandit (CMAB) environment. We quantify the similarity between the ground truth CSIs and identified CSIs with the Structural Hamming Distance (SHD) \citep{acid2003searching}. The CMAB problem introduces contextual information to the classical Multi-Armed Bandit problem, allowing the agent to leverage state-specific cues for action selection and maximize cumulative rewards over time. The synthetic environment features a combinatorial action space  \citep{Chen2014CombinatorialMB, Qin2014ContextualCB, Chen2018ContextualCM}, factorized as $\gA=\gA^{1} \times \gA^{2} \times \gA^{3}$ where $|\gA^{i}|=7$. The reward received at each time step is equivalent to the current state $r_t=s_t$. With a horizon of $H=25$, state space $\mathbb{N}_{0}$, and $s_0=0$, state transitions are solely determined by a sub-action, which varies across states. The sub-action is identified by the ground truth CSI $M_t=\{i\}$, where $i=\text{mod}(\lfloor \frac{s_t}{6} \rfloor, 3)$, and thus this variable is state-dependent. Formally, the transition is $s_{t+1}=s_{t}+a_{t}^{i}$ for even $s_t$, otherwise $s_{t+1}=s_{t}+ (6 - a_{t}^{i})$, where $i$ indicates the index of relevant action variable for state $s_t$. Identifying the relevant action variable for a given state is essential due to the large combinatorial action space ($|\gA|=7^3=343$).

\section{Implementation details}
\label{appendix:implementation}

\subsection{Experimental details}
\label{appendix:implementation-experiment}
We employed the Adam optimizer \citep{kingma2015adam} with decoupled weight decay \citep{loshchilov2018decoupled} for training. Each run spanned approximately 48 hours across both environments. Our method and MuZero were trained over $100$k gradient steps (5 runs for DoorKey, 3 for Sokoban). Results in \Cref{fig:num_simulations} are averaged across 7 runs per simulation budget. We use a uniform replay buffer and collect $32$k transitions under a uniform random policy prior to training. Periodic evaluations were conducted every $2000$ update steps using 32 seeds. The confidence intervals of the reported performances were estimated by the $95\%$ percentile bootstrap. Scores are min-max normalized (-150 to 0 for DoorKey, -15 to 10.5 for Sokoban) in \Cref{fig:teaser,fig:aggregate_metrics}, and average episodic return at 40k gradient steps is shown in \Cref{fig:teaser}. All experiments were executed on an NVIDIA RTX 3090 GPU, leveraging JAX and Haiku. We perform no data augmentation of the state. The $96\times96\times3$ state is scaled by $s/255$. The action is factorized and encoded as a one-hot vector per action variable, then one-hot vectors are concatenated into a vector for the action. Adhering to \citet{schrittwieser2020mastering}, the action is spatially tiled and paired with the state for input into the dynamics network. For the CMAB environment, the state was encoded into a $96\times96\times3$ representation through repetition of the normalized state, which is calculated as: $s / ((\max_{i}{\lvert \gA^i \rvert} - 1) \times H)$. Here, $\max_{i}{\lvert \gA^i \rvert}$ equals 7 and $H$ is set to 25.

\input{table/param_muzero}

\subsection{Implementation of MuZero}
\label{appendix:implementation-muzero}
We mostly follow the architectural design of MuZero from \citet{schrittwieser2020mastering, schrittwieser2021online}. Following \citet{schrittwieser2020mastering}, we incorporated categorical representations for the value and reward predictions. Dirichlet noise is added to the policy prior as follows: $(1-\rho) \pi(a|s)+\rho \mathcal{N}_{\mathcal{D}}(\xi)$, where $\rho=0.25$, the $\mathcal{N}_{\mathcal{D}}(\xi)$ is the Dirichlet noise distribution, and $\xi=0.0$ during evaluations and for non-root nodes, otherwise the noise ratio is set to $0.3$. Similar to \citet{ye2021mastering}, we reduce the number of residual blocks and channel dimensions due to the high computational cost of the original network architecture used in MuZero. We use the kernel size 3$\times$3 for all operations, unless otherwise specified. The architecture comprises four components: the representation network, the dynamics network, the prediction network, and the reconstruction network. 

\noindent The architecture of \textbf{the representation network} is as follows:
\begin{itemize}
    \item 1 convolution with stride 2 and 32 output planes, output resolution 48x48. (LayerNorm + ReLU)
    \item 1 residual block with 32 planes.
    \item 1 residual downsample block with stride 2 and 64 output planes, output resolution 24x24.
    \item 1 residual block with 64 planes.
    \item Average pooling with stride 2, output resolution 12x12. (LayerNorm + ReLU)
    \item 1 residual block with 64 planes.
    \item Average pooling with stride 2, output resolution 6x6. (LayerNorm + ReLU)
    \item 1 residual block with 64 planes.
\end{itemize}

\noindent The architecture of \textbf{the dynamics network} is as follows:
\begin{itemize}
    \item Concatenate the input states and input actions.
    \item 1 convolution with stride 2 and 64 output planes. (LayerNorm)
    \item A residual link: add up the output and the input states. (ReLU)
    \item 1 residual block with 64 planes.
\end{itemize}

\noindent The architecture of \textbf{the prediction network for the reward prediction} is as follows:
\begin{itemize}
    \item 1 1x1 convolution with 16 output planes. (LayerNorm + ReLU)
    \item Flatten
    \item 1 fully connected layers and 32 output dimensions (LayerNorm + ReLU)
    \item 1 fully connected layers and 601 output dimensions.
\end{itemize}

\noindent The architecture of \textbf{the prediction network for the policy and value prediction} is as follows:
\begin{itemize}
    \item 1 residual block with 64 planes.
    \item 1 1x1 convolution with 16 output planes. (LayerNorm + ReLU)
    \item Flatten
    \item 1 fully connected layers and 32 output dimensions. (LayerNorm + ReLU)
    \item 1 fully connected layers and D output dimensions,
\end{itemize}
where $D=601$ in the value prediction network and $D=|\mathcal{A}|$ in the policy prediction network. The policy and value prediction network shares the initial residual block.

\noindent The architecture of \textbf{the reconstruction network} is as follows:
\begin{itemize}
    \item 1 residual block with transposed convolution, stride 1, and 64 output planes.
    \item 1 residual block with transposed convolution, stride 2, and 64 output planes.
    \item 1 residual block with transposed convolution, stride 1, and 64 output planes.
    \item 1 residual block with transposed convolution, stride 2, and 64 output planes.
    \item 1 residual block with transposed convolution, stride 1, and 64 output planes.
    \item 1 residual block with transposed convolution, stride 2, and 32 output planes.
    \item 1 residual block with transposed convolution, stride 1, and 32 output planes.
    \item 1 transposed convolution with stride 2 and 3 output dimensions. (LayerNorm + ReLU)
\end{itemize}

\noindent We use mostly the same hyperparameter setting as presented in \citet{ye2021mastering}. Details of the hyperparameters are provided in \Cref{table:param_muzero}.

\subsection{Implementation of Our Method}
\label{appendix:implementation-ours}

We build our method on top of the implementation of MuZero. Our method introduces additional hyperparameters required for action abstraction, i.e., the sparsity regularization coefficient $\lambda$, the Gumbel sigmoid temperature $\delta$, and the abstraction threshold $\tau$.
We use the abstraction threshold $\tau$ to induce on-the-fly action abstraction from the probabilities of state-conditioned dependencies for each action variable. The sparsity coefficient is set to $\lambda = 0$ on the DoorKey and Sokoban environments, and $\lambda = 0.01$ for the CMAB. We set the abstraction threshold and the Gumbel sigmoid temperature to $\tau = 0.01$ and $\delta= 1$, respectively, for all experiments.
For our \csi{}, we use Gumbel-Softmax reparametrization for backpropagation \citep{maddison2016concrete,jang2016categorical}, similar to \citet{hwang2023on}.

\noindent The architecture of \textbf{the auxiliary network for discovering CSIs} is as follows:
\begin{itemize}
    \item 1 residual block with 64 planes.
    \item 1 1x1 convolution with 16 output planes. (LayerNorm + ReLU)
    \item Flatten
    \item 1 fully connected layers and 32 output dimensions (LayerNorm + ReLU)
    \item 1 fully connected layers and D output dimensions,
\end{itemize}
where $D$ is the number of action variables and the initial residual block is shared with the policy and value prediction network.

\section{Additional experiments}
\label{appendix:experiments}

\input{figure/appendix/fig_recon}

\input{figure/appendix/fig_gradcam}
\input{figure/appendix/fig_additional_ablation}

In this section, we present additional visualizations (\Cref{fig:appendix_recon,fig:appendix_gradcam}) and ablation results (\Cref{fig:appendix_ablation}) to underscore the benefits of our approach. Experiments in \Cref{fig:appendix_ablation} used a more complex DoorKey environment (five colors, 216 actions vs. four colors, 150 actions in \textsc{Hard} difficulty).
Despite the vast combinatorial action space, our method demonstrates near-optimal performance in \Cref{fig:appendix_ablation}, significantly outperforming vanilla MuZero. Ablations in \Cref{fig:appendix_ablation_recon,fig:appendix_act_abst} confirm that the state-conditioned action abstraction is a vital part of our improvement.

We further investigate the effect of training the \csi{} only with the reconstruction loss to faithfully represent the dynamics transition, as described in \Cref{sec:method-inference}. We train the \csi{} only with the gradients from the reconstruction loss to ensure that it learns the proper context-specific independence from the transition dynamics. In other words, we freeze the parameters of the \csi{} when learning policy, value, and reward. \Cref{fig:ablation_frozen_action_mask_fn} shows that updating the network with solely reconstruction loss (Frozen) enhances stability compared to the \csi{} jointly trained with all losses (Unfrozen).

\paragrapht{Sparsity coefficient $\lambda$.} We report experimental results from an ablation analysis of the sparsity coefficient $\lambda$ in \Cref{fig:appendix_mask_coef}. We evaluated our method with $\lambda$ values of $\{0.0, 0.01, 0.001\}$ on the DoorKey-Easy environment, and the performance of MuZero is also presented for comparison. The results demonstrate that our approach maintains a considerable degree of robustness to variations in $\lambda$. Interestingly, our method with a sparsity coefficient of $\lambda=0.0$ seems to achieve implicit sparsity. Consequently, we employed $\lambda=0.0$ in all primary experiments on the DoorKey and Sokoban environments.

\input{figure/appendix/fig_ablation_mask_coef}

\input{table/space_reduction_and_training_time_comp}
\paragrapht{Search space reduction.} We describe the extent to which the MCTS search space is reduced in \Cref{table:action_space_reduction}. We measured the action space reduction at the root node across Easy, Normal, and Hard difficulty levels within the DoorKey and Sokoban environments after 100,000 gradient steps. \Cref{table:action_space_reduction} illustrate the percentage reduction in search space, thereby demonstrating the efficacy of the proposed action abstraction method. Higher percentages indicate more substantial reductions.


\paragrapht{Training time comparison.}
In addition to sample efficiency, it is crucial to assess a method under constrained computational resources. \Cref{table:training_time} displays the episodic returns obtained by both methods on an NVIDIA RTX 3090 after various training durations (4 hours, 8 hours, 12 hours, 16 hours, and 20 hours) in the DoorKey-Easy environment. For example, after 12 hours of training, our model achieved an episodic return of -3.89, compared to MuZero's -80.60. The wall clock time presented for training each method includes the evaluation time for completing 32 evaluation episodes every 2000 update steps, as detailed in  \Cref{appendix:implementation-experiment}. The result show that our method is not computationally intensive relative to the baseline. Further discussion on the computational overhead of our method is provided in \Cref{appendix:discussion}.

\section{Additional Discussions}
\label{appendix:discussion}

\paragrapht{Challenges and approaches in identifying CSIs.}
Deriving CSIs from the dynamics model has been a challenging problem. Even if we have access to the dynamics model, the context-specific independences inherent in a factored MDP remain latent and challenging to discern. An approach to approximate these CSIs could be the sample-based testing algorithm introduced in CAMPs \citep{chitnis2021camps}.
However, it falls short in uncovering CSIs within a latent dynamics model and struggles with scalability in larger state and action spaces. The algorithm's time complexity further complicates its application, especially in pixel-based environments. Additionally, identifying all independence beforehand is a formidable task. In contrast, our method progressively learns CSIs as training advances.

\paragrapht{Learning and leveraging the action mask.}
While querying the legal action might be possible if the underlying simulator is provided, it is not feasible when we consider using the learned latent dynamics model during the search procedure. Our approach involves learning and inferring an action mask for each state, applicable in the search procedure. This action mask not only delineates illegal actions but also identifies redundant action variables.

\paragrapht{Benefits of leveraging CSIs over policy network without action abstraction.}
Consider a CMAB environment characterized by a factored action space $A = A^1 \times A^2$ with $A^1$ and $A^2$ having potential values of $\{0, 1\}$. In our CMAB environment, the consequent state is exclusively dependent on an action variable. If we consider a state $s_1$ where any action incorporating $A^1=1$ transitions into a rewarding state $s_2$, whereas other actions revert to the same non-rewarding $s_1$, actions with $(1, 0)$ and $(1, 1)$ are equally optimal. Hence, the policy in standard MCTS methods like MuZero would select both actions, resulting in fewer visitation and inaccurate statistics for each action.


\paragrapht{Comparison to CAMPs.}
CAMPs \citep{chitnis2021camps} consider a goal as a context and introduce a context-specific abstraction for each goal in preparation for planning. This approach cannot employ CSIs in each state. For instance, the method proposed in CAMPs \textit{does not induce any action abstraction} in DoorKey environments since all action variables are required to reach the designated goal. Our work, on the other hand, is capable of inferring context-specific independence on-the-fly, while additionally leveraging action abstraction at every individual state.

\paragrapht{Computational overhead of training the auxiliary network.}
In terms of computational efficiency, our implementation of the method demonstrates comparable performance to MuZero, with each gradient step requiring approximately the same time. While it may be differ depending on the implementation, the proposed method for learning CSI relationships end-to-end incorporates an \csi{} coupled with a sparsity loss. We merely use a residual block and fully connected layers, as detailed in \Cref{appendix:implementation-ours}. Notably, the architecture parallels that of both the policy and value prediction networks and the initial parts of the network are shared with the policy and value networks.
