\begin{abstract}
The difficulty of deterministic planning increases exponentially with search-tree depth.
Black-box planning presents an even greater challenge, since planners must operate without an explicit model of the domain.
Heuristics can make search more efficient, but goal-aware heuristics for black-box planning usually rely on goal counting, which is often quite uninformative.
In this work, we show how to overcome this limitation by discovering macro-actions that make the goal-count heuristic more accurate.
Our approach searches for macro-actions with focused effects (i.e. macros that modify only a small number of state variables), which align well with the assumptions made by the goal-count heuristic.
Focused macros dramatically improve black-box planning efficiency across a wide range of planning domains, sometimes beating even state-of-the-art planners with access to a full domain model.
\end{abstract}

\section{Introduction}

In classical planning, an agent must select a sequence of deterministic, durationless actions to transition from a known initial state to a state that satisfies the desired goal condition.
Planning assumes the agent has access to a model of the effects of its actions, which it uses to reason about potential plans.
Usually this model takes the form of a PDDL description or finite-domain representation \cite{fox2003pddl2,helmert2009concise}, which specifies the preconditions and effects of each action.
However, in black-box planning \cite{lipovetzky2015classical,jinnai2017learning}, the model is instead defined implicitly by a simulator that the agent can query to generate state transitions.

In general, planning is hard: determining whether a plan exists to reach the goal is \textit{PSPACE}-complete \cite{bylander1994computational}.
Heuristic search eases this computational burden by guiding the search towards promising solutions.
Of course, heuristic search is only useful with a good heuristic.
In classical planning, much work has gone into the development of domain-independent methods that automatically construct heuristics to exploit as much problem structure as possible from the formal PDDL problem description \cite{bonet2001planning,hoffmann2001ff,helmert2006fast,Helmert2009Landmarks,Helmert2014MergeandShrink,Pommerening2015FromNonNegative,Keyder2014ImprovingDelete,Domshlak2015RedBlack}.
However, black-box planners have no formal domain description to exploit, and are therefore limited to less-informed heuristics.
%This poses a problem because an informative heuristic is especially important for black-box planning, where querying the simulator can be computationally expensive.

One simple, domain-independent heuristic that is compatible with simulator-based planners is the goal-count heuristic \cite{fikes1971strips}, which counts the number of state variables that differ between a given state and the goal.
The two basic assumptions of the goal-count heuristic are: a factored state space (i.e. there are state variables to count), and a known goal condition (i.e. there is a reason to modify variables).
A third, more subtle assumption is that the problem can be decomposed into subproblems, where each state variable can be treated as an approximately independent subgoal.
Unfortunately, this subgoal independence assumption is invalid for most planning problems of practical interest, and thus the goal-count heuristic is often misleading.

A second domain-independent strategy for improving planning efficiency is to use abstraction in the form of high-level macro-actions.
When macro-actions are added to the set of low-level actions, they can reduce search tree depth at the expense of increasing the branching factor.
In some cases, this has been shown to improve planning efficiency, particularly when the macro-actions cause the problem's subgoals to become independent \cite{korf1985macro}.
We further explore this idea in the context of black-box planning by constructing macros that are well aligned with the goal-count heuristic.

We begin by examining why goal counting becomes uninformative for certain sets of actions.
We show that both goal-count accuracy and planning efficiency are linked to how many state variables actions can modify at once.
Our investigation suggests a compelling strategy for improving the usefulness of the goal-count heuristic: learning \textit{focused} macro-actions that modify as few variables as possible, so as to align with the assumptions made by the goal-count heuristic.
This approach also seems well-aligned with human problem solving, for example, among expert Rubik's cube solvers, where focused macros are essential for the most efficient planning strategies.

We describe a method for discovering focused macro-actions and test it on several classical planning benchmarks, restricting our attention to quickly finding feasible plans, rather than optimal ones, with the goal of minimizing the number of simulator queries.
Our learned macro-actions enable reliable and efficient planning, making dramatically fewer calls to the simulator and improving solve rate on most domains.
Our approach is designed to improve the goal-count heuristic, but it is compatible with more sophisticated black-box planning techniques as well---with similar improvement.
In some cases, black-box planning with focused macros is even competitive with approaches that have access to much more detailed problem information.

\section{Background}
% \subsection{Black-Box Planning}
We consider the problem of black-box planning \cite{jinnai2017learning}, where the planning agent does not have access to a declarative action description.
Formally, we define a black-box planning domain using the following quantities:
\begin{itemize}
    \item A set of states $S$, where each state is represented as a vector $v$ and each element $v_i$ is a state variable assignment from some finite domain $D(v_i)$;
    \item An action applicability function $A(s)$ that outputs the set of valid grounded actions for the given state $s \in S$;
    \item A deterministic\footnote{In general, black-box planning can include probabilistic effects, but we leave this more general case for future work.} simulator function $\textrm{Sim}(s,a)$, which the agent can query to determine the next state $s'$ after executing the action $a \in A(s)$ from state $s \in S$.
\end{itemize}
Each of the above quantities is fixed for all planning problem instances in the domain. A particular problem instance additionally contains:
\begin{itemize}
    \item A start state, $s_0 \in S$;
    \item A goal condition $G$, represented as a list of variable assignments to all (or some subset) of the state variables.
    % \footnote{Single-state goals can be expressed as conjunctions with one literal per state variable.}
\end{itemize}

\noindent The planner's objective is to find a plan that connects state $s_0$ to any state $s_G$ that satisfies $G$ via a sequence of actions.
In general, actions can have associated costs, and an optimal plan is one that minimizes the sum of its action costs.
Here we are concerned with planning efficiency, so we focus on \textit{satisficing} solutions---that is, finding a plan as quickly as possible, regardless of cost.
We measure planning efficiency in terms of the number of simulator queries (equivalently, the number of generated states) before finding a plan.

\subsection{The Goal-Count Heuristic}
The goal-count heuristic, $\#g$, is defined in terms of the problem-specific goal, $G$.
For any goal condition $G$, $\#g(s)$ counts the number of variables in state $s$ whose values differ from those specified in $G$, with $\#g(s)=0$ if and only if $s$ satisfies $G$.
A well-known downside of the goal-count heuristic is its dependence on the size of $G$.
In the extreme case where $|G|=1$, the goal-count heuristic only separates goal states from non-goal ones.
Nevertheless, due to the relative lack of information in black-box planning, the goal-count is often the only goal-aware heuristic available.
Other planners may add additional components, such as state novelty \cite{frances2017purely}, but the black-box versions of those planners still rely on goal counting at their core.

\subsection{Macro-Actions}
A \textit{macro-action} (or \textit{macro}), is a deterministic sequence of actions,\footnote{For simulators with probabilistic effects, macro-actions could in principle be generalized to more complex abstract skills incorporating state information, but that extension is beyond the scope of this work.} typically for the purpose of accomplishing some useful subgoal.
To avoid confusion, we often refer to the original non-macro actions as ``primitive'' actions.
Macros have parameters, preconditions, and effects, just like primitive actions, but in black-box planning, we again assume that the planner does not have access to such a declarative description.
Instead, when planning with macro-actions, there are two alternatives.
Either the action applicability function $A(s)$ and simulator function $\textrm{Sim}(s,a)$ are updated to additionally compute \emph{macro-action} validity and effects in a single step, or alternatively, each primitive action in the macro must be simulated sequentially, with longer macros requiring more simulator queries.

\section{Effect Size and Goal-Count Accuracy}
The goal-count heuristic implicitly treats each state variable as an independent subgoal.
There are two ways to satisfy this assumption exactly.
The first is if each subgoal can be achieved in one step without modifying any other state variable.
The second, more general way, explored by \textcite{korf1985macro}, is if each subgoal can be achieved in one step (possibly modifying other state variables) and the subgoals are serializable---i.e. there is an ordering of the state variables that retains previously-solved subgoals when solving new ones.

In general, an action can of course change many state variables, and the problem representation may not allow the subgoals to be serialized---both of which can cause the goal-count heuristic to be uninformative.
However, for a heuristic to be useful, it does not need to be perfect; it simply needs to be \textit{rank correlated} with the distance to the goal: higher true distances should correspond to higher heuristic values \cite{wilt2015building}.
When the heuristic is perfectly rank correlated, there is a monotonic relationship between heuristic and true cost, and best-first search will always expand nodes in order of their true distance from the goal.

We hypothesize that if each action modifies only a small number of state variables, the problem will better match the assumptions of the goal-count heuristic, and thus the heuristic and true goal distance will be more positively rank correlated.
We informally say such actions have ``focused" effects, and we formalize this idea with the following definitions:

\begin{definition} \emph{The \emph{effect size of an action} is the maximum number of state variables whose values change by executing the action, over all states where the action is applicable.}
\end{definition}

\begin{definition} \emph{The \emph{effect size of a macro-action} is the maximum number of state variables, measured at the end of macro-action execution, that are different from their starting values, over all states where the macro-action is applicable, even if additional variables were modified during execution.}
\end{definition}

If our hypothesis above is correct, we expect the goal-count heuristic to be more accurate for domains where actions have smaller effect size, and we further expect this to lead to an improvement in planning efficiency.
In the following experiment, we see better rank correlation between heuristic and true distance for domains whose actions have low average effect size, and we see that this leads to an approximately exponential improvement in planning efficiency.

\subsection{The Suitcase Lock Domain}
To study the relationship between effect size and planning efficiency, we introduce the Suitcase Lock domain.
The Suitcase Lock is a planning problem whose solution requires entering a combination on a lock with $N$ dials, each with $M$ digits, and $2N$ actions, half of which increment a deterministic subset of the dials (modulo $M$) and the other half of which decrement the same dials (see Figure \ref{fig:suitcase-results}, top).
For each problem instance, a start state, goal state, and (fixed) action set are generated randomly, and a parameter $\overline{k}$ controls the mean effect size across all actions.
This allows us to examine action effect size while holding other problem variables constant.
For implementation details, see Appendix \appref{appendix:suitcase-impl}{A} of the supplementary materials.

% Suitcase Lock planning time
\begin{figure}[t]
\centering
\begin{subfigure}{.16\textwidth}
  \centering
  \includegraphics[width=\textwidth]{images/suitcase-lock}
  \caption*{}
\end{subfigure}%
\clearpage
\begin{subfigure}{.23\textwidth}
  \centering
  \includegraphics[width=.95\textwidth]{images/suitcase-binary-cb}
  \caption*{$N=20, M=2$}
\end{subfigure}%
\begin{subfigure}{.23\textwidth}
  \centering
  \includegraphics[width=.95\textwidth]{images/suitcase-4ary-cb}
  \caption*{$N=10, M=4$}
\end{subfigure}
\caption{(Top) The Suitcase Lock domain with $N=4$; (Bottom) Generated states vs. effect size for Suitcase Lock.}
\label{fig:suitcase-results}
\end{figure}

\subsubsection{Focused Actions Improve the Goal-Count Heuristic}
We first investigate the accuracy of the goal-count heuristic for two small Suitcase Lock problems. %: one with $N=10$ and $M=2$; and the other with $N=5$ and $M=4$.
For each possible effect size, we compute the true distance between all pairs of states, and compare the results with the goal-count, treating the second state of each pair as the goal. % $\bar{k} \in \{1,...,N-1\}$
We compute the average heuristic value for each true distance, and then compute the Pearson correlation and Spearman rank correlation coefficients between heuristic and distance. % across 10 random seeds for each $\bar{k}$
The results are shown in Table \ref{tab:correlation}, where we see that actions with more focused effects (i.e. lower $\bar{k}$) lead to significantly higher correlation.



\subsubsection{Focused Actions Improve Planning Efficiency}
Next, we run two planning experiments using the goal-count heuristic and greedy best-first search (GBFS). To evaluate planning efficiency, we measure the number of generated states needed to solve each instance, since we care about feasibile plans, rather than optimal ones.
% We run two experiments %: first with $N=20$ and $M=2$, and second with $N=10$ and $M=4$, with $\bar{k}$ varying in the range $\{1,...,N-1\}$.
% Since we care about feasible plans, rather than optimal ones we evaluate performance using the number of generated states.% across $100$ different random seeds

% Correlation table
\begin{table}[t]
    \centering
    \begin{tabular}{crrrr}
    \toprule
    Effect Size  & \multicolumn{2}{c}{\textbf{N=10, M=2}} & \multicolumn{2}{c}{\textbf{N=5, M=4}}\\
    $\bar{k}$ & $\rho_P$ & $\rho_S$ & $\rho_P$ & $\rho_S$ \\
    \midrule
    1 & 1.000 &  1.000 & 0.775 &  0.760 \\
    2 & 0.200 &  0.179 & 0.263 &  0.226 \\
    3 & 0.110 &  0.092 & 0.046 &  0.018 \\
    4 & 0.060 &  0.041 & 0.000 & -0.044 \\
    5 & 0.020 &  0.013 & -- & -- \\
    6 & 0.000 & -0.007 & -- & -- \\
    7 & 0.000 &  0.001 & -- & -- \\
    8 & 0.000 & -0.001 & -- & -- \\
    9 & 0.000 &  0.005 & -- & -- \\
    \bottomrule
    \end{tabular}
    \caption{Correlation results between the goal-count heuristic and true distance for Suitcase Lock.
    Actions with smaller effect size ($\bar{k}$) lead to significantly higher Pearson's correlation ($\rho_P$) and Spearman's rank correlation ($\rho_S$) coefficients.}
    \label{tab:correlation}
\end{table}

Figure \ref{fig:suitcase-results} (bottom) shows an approximately exponential relationship between effect size and planning time.
When $M=2$ and $k=1$, the goal-count heuristic is exactly equal to the cost, and GBFS can generate at most ${N}^2$ states before finding the goal.
By contrast, when $k=(N-1)$ the heuristic is maximally uninformative, and GBFS may generate $N\cdot 2^{N}$ states in the worst case, since it cannot expand any nodes with heurstic value $k$ until it has expanded all nodes with heuristic value $< k$.
This exponential trend appears to hold even when state variables are not binary.

These results on the Suitcase Lock domain suggest that reducing effect size is a viable strategy for improving planning efficiency.
To further investigate this idea, we propose a method for learning macro-actions with low effect size.

\section{Learning Macros with Focused Effects}
\label{sec:macro-learning}
We search for macro-actions using best-first search (BFS) with a simulation budget of $B_M$ state transitions.
We start the search at a randomly generated state, and the search heuristic is macro-action effect size---or infinity if the macro-action modifies zero variables---plus the number of primitive actions in the macro.
Technically, we would need to evaluate each macro from every valid state to determine its effect size, which is clearly infeasible.
In practice, we simply measure each macro's effect size once and assume it doesn't change (although we could easily relax this assumption by running the macro from multiple states).

We save the $N_M$ macro-actions with the lowest effect size, and ignore duplicate macro-actions that have the same net effect.
To encourage diversity of macros, we can optionally repeat the search $R_M$ times, each time generating a new random starting state in which none of the existing saved macro-actions are valid, or until we fail to find such a starting state.\footnote{
Finding such a state may be as hard as planning, unless the simulator can be reset to generate new starting states; however, in practice, a random walk is often sufficient (see appx. Note \appref{note:finding-macro-start-states}{B.2.1}).
}
This ensures that we still find macros that apply in most situations, even if there are constraining preconditions.

\begin{algorithm}[!t]
\caption{Learn macro-actions with focused effects}
\label{alg:algorithm}
\textbf{Input}: Starting state $s_0$, number of macro-actions $N_M$, number of repetitions $R_M$, search budget $B_M$

% \textbf{Parameter}: Optional list of parameters\\
\textbf{Output}: List of macro-actions $L_M$

\begin{algorithmic}[1] %[1] enables line numbers
\STATE \textbf{Define} $g(m) := \mathrm{length}(m) $\\
\STATE \textbf{Define} $h(s) := \left\{
	\begin{array}{ll}
	    |\mathrm{net\_effects}(s-s_0)| & \mbox{if } > 0,\\
		\infty  & \text{otherwise}
	\end{array}
\right.$
\STATE \textbf{Define} $f(s,m) := g(m) + h(s)$\\
where $m$ is the macro (i.e. action sequence) from $s_0$ to $s$\\
\hspace{10px}
\STATE Let $L_M$ be an empty list of macro-actions
\STATE Let $Q$ be a (max) priority queue of size $N_M/R_M$
\FOR{repetition $r$ in $\{1,...,R_M\}$}
\STATE Run best-first search (BFS) from $s_0$ with budget $B_M/R_M$, minimizing heuristic $f(s,m)$
\FOR{each state $s_i$ and macro $m_i$ visited by BFS}
\STATE Store $m_i$ in $Q$, with priority $h(s_i)$\\
// When $Q$ becomes full, the action sequences\\
// with largest $h$-score will get evicted first
\ENDFOR
\STATE Add each unique macro in $Q$ to $L_M$
\STATE Clear $Q$
\STATE $s_0 \gets$ new random state, such that none of the macros in $L_M$ can run
\IF {$s_0$ is None}
\STATE \textbf{break}
\ENDIF
\ENDFOR
\STATE \textbf{return} $L_M$
\end{algorithmic}
\end{algorithm}

The pseudocode for this procedure is in Algorithm \ref{alg:algorithm}.
We use $m_i$ to denote the macro whose action sequence generated state $s_i$.
Consider an example with two primitive actions $a_1$ and $a_2$, where BFS starts at state $s_0$. Expanding $s_0$, the action $a_1$ generates $s_1$, and $a_2$ generates $s_2$. Expanding $s_1$, $a_1$ generates $s_3$, and $a_2$ generates $s_4$. Thus macro $m_4$, corresponding to state $s_4$, would be the action-sequence $[a_1, a_2]$, and we would evaluate its net effect by comparing $s_4$ with $s_0$.

% For a planning problem with $|S|$ states and $|T|$ valid state transitions, this algorithm has time complexity $O(\min (B_M, |S|+|T|))$, and space complexity $O(N_M + \min(B_M, |S|))$.

\section{Experiments}
\label{sec:experiments}

\begin{table*}[t]
\centering
\begin{tabular}{lrrrrrrrrrrrr}
\toprule
       &    &    &  \multicolumn{2}{c}{GBFS(A)}  &   \multicolumn{2}{c}{GBFS(A+M)}  &   \multicolumn{2}{c}{BFWS(A)}    &    \multicolumn{2}{c}{BFWS(A+M)}   &    \multicolumn{2}{c}{LAMA(A)} \\
\cmidrule(lr){4-7}\cmidrule(lr){8-11}\cmidrule(lr){12-13}
Domain        &  $N_M$  &  $B_M$  &  Gen & Sol  &  Gen & Sol  &   Gen & Sol    &    Gen & Sol   &    Gen & Sol \\

\midrule
Depot    & 8 & 50K  & 58275.9 & \textbf{0.74} & \textbf{55132.4} & 0.60          & 75966.9 & \textbf{0.48} & \textbf{72205.8} & 0.34 &  46620.9 & 1.00 \\
Doors    & 8 & 5K   &  3050.7 & 1.00          &   \textbf{512.6} & 1.00          &  4660.9 & 1.00 & \textbf{3057.3} & 1.00 &    293.0 & 1.00 \\
Ferry    & 8 & 5K   &  1875.8 & 1.00          &  \textbf{1151.4} & 1.00          &  1209.9 & 1.00 & \textbf{1163.5} & 1.00 &    699.8 & 1.00 \\
Gripper  & 8 & 5K   &  7314.8 & 1.00          &  \textbf{6277.0} & 1.00          & 44945.9 & 1.00 & \textbf{6295.9} & 1.00 &   6493.1 & 1.00 \\
Hanoi    & 8 & 100K & 78433.6 & 0.78          &  \textbf{6358.8} & \textbf{1.00} & 63455.2 & 1.00 & \textbf{3365.9} & 1.00 &  65496.4 & 1.00 \\
Miconic  & 8 & 5K   &  7559.4 & 1.00          &  \textbf{1907.1} & 1.00          & 10269.2 & 1.00 & \textbf{1884.3} & 1.00 &   1316.7 & 1.00 \\
\midrule
15-Puz.  &      192  &            32K  &  30840.5 & 1.00  & \textbf{4952.4} & 1.00  &  109425.2 & 1.00  &    \textbf{6290.1} & 1.00  &         -- & -- \\
Rubik's    &       576  &            1M  &       $>$2M & 0.00  &   \textbf{171.3K} & \textbf{1.00}  &       $>$2M & 0.00  &  \textbf{163.8K} & \textbf{1.00}  &  9.13M & 1.00 \\
\bottomrule
% \\
% \multicolumn{13}{c}{(A) - primitive actions only; (A+M) - primitive actions + focused macros}\\
% \multicolumn{13}{c}{$N_M$ - number of macros; $B_M$ - macro-learning budget; Gen - generated states; Sol - solve rate.}
\end{tabular}
\caption{Black-box planning results for PDDLGym-based simulators (top), and domain-specific simulators (bottom).
% Adding focused macros improves planning efficiency and solve rate.
(A) - primitive actions only; \mbox{(A+M) - primitive actions} + focused macros; $N_M$ - number of macros; $B_M$ - macro-learning budget; Gen - generated states; Sol - solve rate; (bold) - best performance of each planner. The efficiency of both GBFS and BFWS($R^*_G$) are improved by adding focused macros. Note that LAMA is an informed planner with access to much more information than black-box planners, and is only included for reference.
}
\label{tab:results-all}
\end{table*}

We evaluate our method by learning macro-actions in a variety of black-box planning domains and subsequently using them for planning.
% Code is included in supplementary materials.
% \footnote{Code included in supplementary materials.}
% \footnote{Code at \url{https://github.com/camall3n/skills-for-planning}}
We use PDDLGym \cite{silver2020pddlgym} to automatically construct black-box simulators from classical PDDL planning problems.
Additionally, we use two domain-specific simulators (for 15-puzzle and Rubik's cube) that have a different state representation to show the generality of our approach.
See the appendix for implementation details and a discussion of how we selected the various macro-learning hyperparameters (sections \appref{appendix:simulator-details}{B} and \appref{appendix:reproducibility}{E}, respectively).

We select the domains to give a representative picture of how the method performs on various types of planning problems.
For PDDLGym compatibility reasons, we restrict the domains to those requiring only \texttt{strips} and \texttt{typing}.
For the domain-specific simulators, we select 15-puzzle and Rubik's cube in particular, because they present opposing challenges for our macro-learning approach.
In 15-puzzle, primitive actions have very focused effects (each modifies only the blank space and one numbered tile), but naively chosen macro-actions tend to have much larger effect sizes, and both primitive actions and macros have state-dependent preconditions.
In Rubik's cube, actions and macros have no preconditions, but primitive actions are highly non-focused (each modifies 20 of the simulator's 48 state variables) and the state space is so large
($\sim\!4.3\!\times\!10^{19}$ unique states \cite{rokicki2014gods})
that black-box planning is unable to solve the problem efficiently.

\subsection{Methodology}
For each planning domain, we generate 100 problem instances with unique random starting states and a fixed goal condition.\footnote{All problem instances are included in the code repository.} %\footnote{We include all problem instances in the code repository for reproducing our results.}
All problem instances share the same state space, and the planner has access to the simulator function, the action applicability function, a vector of state information, and the goal condition.
We emphasize that although the PDDLGym domains are specified using PDDL, the planner never sees the PDDL during either macro search or planning.

We learn focused macro-actions as described in Sec. \ref{sec:macro-learning} and add them to the set of primitive actions, which ensures that the same set of states can still be reached.
These macros are then used to update the simulator and action applicability function, allowing the learned macros to execute in a single step for improved computational efficiency.\footnote{
See Appendix \appref{appendix:updating-sim}{C} for details on how we update the simulator and action applicability function to incorporate the learned macros.
}
Note that updating the simulator in this way does not reduce search effort, only time.
Even if the primitive actions in a macro were simulated one-by-one, the intermediate states are neither stored nor explored, and hence do not count towards the number of generated states.

The macros are learned once, for the first problem instance, and then reused on all remaining problem instances for that domain.
In general, it can be challenging to incorporate macros into any planning algorithm, since one must weigh their search benefits against the increased branching factor.
For simplicity, our experiments fixed the number of macros $N_M$ (see Table \ref{tab:results-all}), but in principle $N_M$ could be chosen automatically based on which macros reduce the problem's average effect size.

To solve each planning problem, we use greedy best-first search (GBFS) with the goal-count heuristic and compare performance with the additional learned macro-actions versus with primitive actions alone.
We measure planning efficiency as the number of simulator queries that the planner makes before finding a plan.
This choice of performance metric is the most natural fit for black-box planning, and it allows for fair comparisons of algorithms across different implementation languages and hardware configurations.
%Because we customize the simulator to handle the learned macros, each macro application results in just a single generated state.

In Table \ref{tab:results-all}, we show the average solve rate and number of generated states (i.e. simulator queries) for each domain.
Since we only pay the macro-learning cost $B_M$ for the first problem instance, we can in principle amortize this cost over the total number of problem instances. (Note that the $B_M$ values reported in the table are non-amortized and are separate from the number of generated states.) Except in the case of Depot, we see that planning with focused macros increases solve rate and improves planning efficiency by up to an order of magnitude versus planning with primitive actions alone.
In Rubik's cube, focused macros still perform better, even if we account for the \textit{entire} macro-learning budget.

\subsection{Comparisons with Other Planners}
In addition to greedy best-first search (GBFS) with the goal-count heuristic, we also evaluate our method in conjunction with Best-First Width Search, or BFWS \cite{lipovetzky2017best}, a family of search algorithms that augment their search heuristic with a novelty metric computed using Iterated Width (IW) search \cite{lipovetzky2012width}.

We specifically use the best-performing black-box planning version of BFWS: BFWS($R^*_G$) \cite{frances2017purely}.
This version starts by running IW up to two times, with increasing precision, to generate a set $R^*_G$ of goal-relevant atoms.
During search, each state $s$ is evaluated based on how many relevant atoms were satisfied at some point along the path to $s$.
This forms a relevance count $\#r(s)$, which is combined with the goal-count $\#g(s)$ to compute the novelty width metric $w_{\#r,\#g}$.
The algorithm runs GBFS using heuristic $(w, \#g, c)$, evaluating nodes first by width, breaking ties with $\#g$, and then breaking further ties with $c$, the cost to reach the node.

We ran BFWS on each domain and measured its planning efficiency (see Table \ref{tab:results-all}).
We followed \textcite{lipovetzky2017best} and limited the width precision to $w \in \{1, \textrm{\textgreater}1\}$ on Depot and Rubik's cube to save computational resources.

Again we find that focused macros substantially improve planning efficiency, likely because the heuristic still uses goal counting at its core.
Surprisingly, we found that BFWS did not perform significantly better than the primitive-action GBFS baseline.
In fact, comparing against GBFS, we observe that focused macros alone are more beneficial for planning than the more sophisticated novelty-based heuristic.


As a point of reference, we also compared against LAMA \cite{richter2010lama} which has full access to a declarative representation of the problem---information far beyond what is available to black-box planners.
We ran the first iteration of LAMA on the same problems we used with PDDLGym, as well as a \rm{SAS}$^{+}$ representation of the Rubik's cube, adapted from \textcite{buchner2018abstraction}.
% \footnote{On a different PDDL version of Rubik's cube, LAMA failed to complete the translation step before running out of memory (16GB).}
On a different PDDL version of Rubik's cube, LAMA failed to complete the translation step before running out of memory (16GB).
We find our method is competitive with LAMA, across the majority of domains, despite the fact that LAMA has access to more information.
On the 100 hardest Rubik's cube problems from \textcite{buchner2018abstraction}, which neither primitive-action baseline can solve, LAMA generates 9.1 million states on average, whereas our approach generates only 171 thousand.

\begin{table}[b!]
    \def\arraystretch{1.15}%1 is the default, change whatever you need
    \centering
    \begin{tabular}{llrrr}
    \toprule
     & & Generated & \multicolumn{1}{c}{Remaining} & Solve \\
     & & States & \multicolumn{1}{c}{Errors ($\#g$)} & Rate \\
     \midrule
     \multirow{3}{*}{\rotatebox[origin=c]{90}{ {15-Puz.}}}&
    Primitives only    &  30840.5 & 0.0 & 1.0 \\
    &Random macros      & 72542.3 & 0.0 & 1.0 \\
    &Focused macros     & \textbf{4952.4} & 0.0 & 1.0 \\
    \midrule
    \midrule
    \multirow{4}{*}{\rotatebox[origin=c]{90}{{Rubik's}}} &
    Primitives only & $>$2M             & 11.8 &  0.0 \\
    & Random macros   & $>$2M             & 16.4 &  0.0 \\
    & Focused macros  & \textbf{171331.4} & \textbf{0.0} & \textbf{1.0} \\
	\cline{2-5}
    & Expert macros   &  30229.1  & 0.0 & 1.0 \\
    \bottomrule
    \end{tabular}
    \caption{Planning results for 15-puzzle and Rubik's cube comparing different action spaces. Random macros perform significantly worse than both primitive actions and focused macros. Trials with macros also include the primitive actions.}
    \label{tab:results-by-macro-type}
\end{table}

\begin{figure*}[ht!]
\centering
    \begin{minipage}[b]{.32\textwidth}
    \includegraphics[width=0.99\textwidth]{images/npuzzle-planning-boxes-small-cb}
    \subcaption{}\label{fig:npuzzle-planning}
    \end{minipage}\quad
    \begin{minipage}[b]{.32\textwidth}
    \includegraphics[width=0.99\textwidth]{images/cube-planning-lines-small-cb}
    \subcaption{}\label{fig:cube-planning}
    \end{minipage}\quad
    \begin{minipage}[b]{.32\textwidth}
    \includegraphics[width=0.99\textwidth]{images/cube-entanglement-small-cb}
    \subcaption{}\label{fig:cube-entanglement}
    \end{minipage}
    \caption{(a) 15-puzzle planning efficiency by macro type. Adding focused macros leads to a significant performance improvement over primitive-actions alone. Random macros have the opposite effect. (b) Rubik's cube planning performance by macro type. The vertical axis represents the best observed goal-count value for the number of generated states on the horizontal axis. (c) Effect size vs. length of Rubik's cube macro-actions, by type. (Some points overlap.)}
\end{figure*}

\subsection{Comparison with Random Macros}
One might wonder whether the improvements in planning efficiency are due to the macros' focused effects, or simply the fact that we are using macros at all.
To isolate the source of the improvement, we conducted a second experiment using 15-puzzle and Rubik's cube.
Here we compared the focused macro-actions against an equal number of ``random'' macro-actions of the same length, which were generated (for each random seed) by selecting actions uniformly at random from the valid actions at each state.

We present the results in Table \ref{tab:results-by-macro-type}, as well as Figures \ref{fig:npuzzle-planning} and \ref{fig:cube-planning},
where we observe that random macros perform significantly worse than both the primitive actions and the learned focused macros.
In both domains, random macros also consistently had larger effect sizes than focused macros.
Figure \ref{fig:cube-entanglement} shows a visualization of Rubik's cube macro effect size versus macro length.
We suspect the higher planning cost of random macros is partly due to their increased effect size.



\subsection{Examining Expert Macros in Rubik's Cube}
Expert human ``speedcubers" use macro-actions to help them manage the Rubik's cube's highly non-focused actions.
In speedcubing, the goal is to solve the cube as quickly as possible, without necessarily finding an optimal plan.
Most speedcubers learn a collection of macro-actions (called ``algorithms'' in Rubik's cube parlance) and then employ a strategy for sequencing those macro-actions to solve the cube.
Expert macro-actions tend to affect only a small number of state variables, and proper sequencing enables speedcubers to preserve previously-solved parts of the cube while solving the remainder.
Common solution methods typically involve multiple levels of hierarchical subgoals and produce plans approximately twice as long as optimal.

As a benchmark, we consider a simplification of the most common expert strategy, where macros are composed of just primitive actions.
We select a set of six hand-coded, expert macro-actions to perform various complementary types of permutations.\footnote{
Macro-action sequences are included in Appendix \appref{appendix:expert-macros}{D}.
}
We visualize one of these macro-actions, which swaps three corner pieces, in Figure \ref{fig:cube-expert}.
Since our simulator uses a fixed cube orientation, we consider all 96 possible variations of each macro (to account for orientation, mirror-flips, and inverses), resulting in 576 total macros---the same number used for the random macro and focused macro trials.

In Figure \ref{fig:cube-entanglement}, we plot the effect size and length of each macro, labeled by macro type.
We can see that the focused macros have significantly smaller effect size than primitive actions or random macros, and begin to approach the effect size of the expert macro-actions.
We also note that the focused macros are somewhat shorter on average than the expert macros, and we suspect that increasing the search budget would result in learning macros with even smaller effects.

In Table \ref{tab:results-by-macro-type} and Figure \ref{fig:cube-planning}, we compare planning with the expert macros against the other macro types and see that while planning with the expert macros is the most efficient, the learned, focused macros are not far behind.
By contrast, the random macros and primitive actions never solved the problem within the simulation budget.
We also found that the average solution length for focused and expert macro-actions was about an order of magnitude longer than typical human speedsolve solutions (378 and 319 primitive actions, respectively, vs. \texttildelow$60$ \cite{speedsolving2021cfop}), which suggests that there are additional insights to be mined from human strategy beyond just learning focused macro-actions.

\subsection{Interpretability of Focused Macros}

\begin{figure}[b!]
    \centering
    \begin{subfigure}{.188\textwidth}
      \centering
      \includegraphics[width=.98\textwidth]{images/cube-expert}
      \caption{Expert 3-corner-swap}
      \label{fig:cube-expert}
    \end{subfigure}%
    \quad\quad
    \begin{subfigure}{.232\textwidth}
      \centering
      \includegraphics[width=.794\textwidth]{images/cube-learned}
      \caption{Learned 3-pair-swap}
      \label{fig:cube-learned}
    \end{subfigure}
    \caption{Expert and learned macro-actions (Rubik's cube).}
    \label{fig:cube-macros}
\end{figure}

We examined the learned focused macros for several domains and found that in addition to having low effect size, they were also frequently easy to interpret.
In 15-Puzzle, one type of macro swapped the blank space with a central tile; another type exchanged three tiles without moving the blank space.
In Rubik's cube, one macro (Figure \ref{fig:cube-learned}) swapped three edge-corner pairs while keeping them connected.
In Tower of Hanoi, macros moved stacks of disks at a time from one peg to another.
We remark that this is quite similar to the interpretability of the human expert macros in Rubik's cube.



% \begin{figure}[!tp]
%   \centering
%   \includegraphics[width=.35\textwidth]{images/npuzzle-set12_planning_time}
%   \caption{15-puzzle planning efficiency by macro type. Adding focused macros leads to a significant performance improvement over primitive-actions alone. Random macros have the opposite effect.}
%   \label{fig:npuzzle-planning}
% \end{figure}

% \begin{figure}[hp]
% \centering
% \includegraphics[width=.37\textwidth]{images/cube_planning}
% \caption{Rubik's cube planning performance by macro type. The vertical axis represents the best observed goal-count value for the number of generated states on the horizontal axis.}
% \label{fig:cube-planning}
% \end{figure}

% \begin{figure}[hp]
% \centering
% \includegraphics[width=.35\textwidth]{images/cube_entanglement}
% \caption{Effect size vs. length of Rubik's cube macro-actions, by type. (Some points overlap.)
% }
% \label{fig:cube-entanglement}
% \end{figure}

\subsection{Generalizing to Novel Goal States}
Since our macro-generation step is goal-independent, we can reuse previously learned macros to solve problems with novel goal states.
To demonstrate this, we generate 100 random goal states for 15-puzzle and Rubik's cube and then solve the puzzles again.
In both domains, we find that planning time and solve rate remain effectively unchanged for novel goal states (see Table \ref{tab:alt-goals}).

\section{Related Work}

The concept of building macro-actions to improve planning efficiency is not new.
\textcite{dawson1977role} considered two-action macros and analyzed domain structure to remove macros that were invalid or that had no effect.
\poscite{korf1985macro} Macro Problem Solver investigated how to learn macros in problems with decomposable operators and serializable subgoals.
Macro-FF \cite{botea2005macro} and MUM \cite{chrpa2014mum} learned macros from training problem instances and later used them to improve the planning efficiency of testing instances.
% \textcite{newton2007learning} studied using genetic algorithms to maximize macro-action ``fitness" without relying on assumptions about the planner or domain.
The macros we discover in Sections \ref{sec:macro-learning} and \ref{sec:experiments} can similarly be reused across problem instances; however, our macro discovery procedure requires neither goal information nor training instances.
Our method is perhaps most similar to MARVIN \cite{coles2007marvin}, which used macros to escape plateaus during heuristic search, and CAP \cite{asai2015solving}, which decomposed planning problems into subgoals and then found macros to achieve those subgoals.
In all of the prior approaches, the learned macros were found to be beneficial for planning, but they also required an explicit model of the domain.
Our method is more general than these methods, as it is designed to handle the unique challenges of black-box planning without an explicit model.

\textcite{lipovetzky2012width} introduced Iterated Width (IW) search, a ``blind'' planner compatible with black-box simulators, and \textcite{lipovetzky2015classical} subsequently applied it to planning in Atari video game simulators without known goal states.
This work led to the goal-informed Best-First Width Search (BFWS) \cite{lipovetzky2017best,frances2017purely}, which we include in our experimental evaluation.
\textcite{jinnai2017learning} formalized black-box planning and described a method for pruning primitive actions and short macros to avoid generating duplicate states; however their approach did not incorporate goal information.

% There has also been work on domain-independent approaches to learning planning heuristics, outside the context of learning macros.
% \textcite{virseda2013learning} showed how to automatically learn a weighted combination of existing heuristics to improve planner robustness.
% \textcite{gomoluch2017towards} trained heuristic functions on known solutions for representative problems, and showed that their heuristics generalized to held-out test problems for a small set of domains.
% \textcite{shen2019learning} recently learned domain-independent heuristics from scratch that outperformed existing baseline heuristics.

Recent work by \textcite{agostinelli2019solving} investigated how to train black-box planning heuristics with neural networks and dynamic programming by strategically resetting the simulator to states near the goal state.
Their approach learned heuristics for several domains, including 15-puzzle and Rubik's cube, that supported fast, near-optimal planning.
However, training their neural network requires more than $\num{1000}$ times the simulation budget of our approach, and results in a heuristic that is only informative for a single goal state, whereas ours works for arbitrary goal states.

\begin{table}[t]
\centering
\begin{tabular}{llrr}
\toprule
Domain & Goal & Generated & Solve\\
       & Type & States    & Rate \\
\midrule
15-puzzle & Default   & 4952.4 & 1.0 \\
& Random    & 4780.0 & 1.0 \\
\midrule
Rubik's cube & Default  & 171331.4 & 1.0 \\
& Random    & 152503.7 & 1.0 \\
\bottomrule
\end{tabular}
\caption{Average planning efficiency and solve rate, when reusing previously-discovered focused macros to solve 15-puzzle and Rubik's cube with either the default goal state or new randomly-generated goal states. Novel goal state performance is effectively unchanged.}
\label{tab:alt-goals}
\end{table}

% We also note that \textcite{openai2019rubiks} demonstrated a robotic hand that could learn to manipulate and solve the Rubik's cube, but they employed an off-the-shelf, domain-specific planner to generate solutions.


\section{Discussion and Conclusion}
We have described a method of learning focused macro-actions that enables reliable and efficient black-box planning across a variety of classical planning domains.
While our approach is designed to match the assumptions of the goal-count heuristic, we find that it also improves the performance of more sophisticated black-box planners.
Moreover, our method is even competitive with a state-of-the-art LAMA planner, despite the latter having access to a declarative description of the problem.

We are encouraged to see that many of the learned macro-actions had intuitive, interpretable meaning in the task domain.
This suggests that our method may be useful for improving explainability in addition to planning efficiency.

This work employed a two-level hierarchy where macro-actions are composed of primitive actions.
One extension to bring this method more in line with human-expert techniques would be incorporating additional levels of action hierarchy (i.e. macros composed of other macros), or macros that permit side-effects to certain unsolved variables, combined with macros to subsequently solve those remaining variables.
We leave an exploration of these ideas for future work.