
In the context of ANMs, \citet{Peters2014} introduce the Regression with Subsequent Independence
Test (RESIT) algorithm to learn the structure of the underlying DAG from observational data. In what follows, we denote the underlying true DAG by \(\mathcal{G}_0\). RESIT
has two phases. The \textbf{first phase} infers a causal order among the variables
involved. A permutation \(\pi : [p] \to [p]\) is a valid causal order for a DAG \(\mathcal{G}\)
if \(\pi(g) < \pi(j)\) for all \(g\in nd_{\mathcal{G}}(j)\). Motivated by
Lemma~\ref{lemma:ancestor_independence}, the following steps are performed \(p-1\) times:
\begin{enumerate}
  \item[(i)] Regress each variable on all others. Then measure independence between the corresponding residuals and all other variables.
  \item[(ii)] The variable that accounts for its \textit{least dependent residual} is considered as a sink node and removed from the set of variables.
\end{enumerate}

The \textbf{second phase} seeks to select the \textit{best} DAG among those that agree with the causal order found in the first phase. Several model selection techniques may be used for this purpose.~\citet{Peters2014} propose to \textit{greedily test away} edges that are present in the super-DAG associated with the inferred causal order. The strategy we pursue involves sparse model selection techniques to prune extraneous edges.

In our setting, we can apply the same two-phase procedure to obtain a group DAG estimate of \(\mathcal{G}_0\). Yet, all involved estimation and testing procedures need to solve much harder statistical problems. We propose solutions to the estimation tasks in \textsc{Phase I} and \textsc{II} (Algorithms~\ref{alg:gRESIT_phase1} and~\ref{alg:gRESIT_phase2}).

\subsection{Learning a Causal Order}

\begin{algorithm}
  \caption{GroupRESIT algorithm \textsc{Phase I}}
  \SetKwInOut{Input}{Input}
  \SetKwInOut{Output}{Output}
  \SetKwInOut{Initialize}{Initialize}
  \Input{IID samples of p-many jointly distributed random vectors \((\mathbf{X}_1, \ldots, \mathbf{X}_p)\).}
  \Output{Learned causal order \(\pi\).}

  \Initialize{\(S \coloneq \{1, \ldots, p\}, \quad  \pi \coloneq [\cdot]\).}
  \Repeat{until \(S = \emptyset\)}{
    \For{\(g\in S\)}{
      \begin{itemize}
        \item[(i)] Regress \(\mathbf{X}_g = (X_1^{(g)}, \ldots, X_{d_g}^{(g)})\) \\
          on \(\{\mathbf{X}_j\}_{j\in S \setminus \{g\}}\)
        \item[(ii)] Measure the independence between \\
          the residuals \(\mathbf{R}_g = (R_1^{(g)}, \ldots, R_{d_g}^{(g)})\) \\
          and the remaining groups \(\{\mathbf{X}_j\}_{j\in S \setminus \{g\}}\)
      \end{itemize}
    }
    Identify the group \(g^*\) that accounts for its least dependent residual\\
    \(S \coloneq S \setminus \{g^*\}\) \\
    \(\pi \coloneq [g^*, \pi]\)
  }
  \label{alg:gRESIT_phase1}
\end{algorithm}

The first phase (Algorithm~\ref{alg:gRESIT_phase1}) consists of repeatedly solving regression
problems on progressively smaller predictor sets. In each iteration, we train multi-output deep
neural networks \(\phi_\theta\) whose output layers are sized to match the number of response variables
in each group.

After training is completed, we compute the residuals
\(\mathbf{R}_g = \phi_\theta(\mathbf{X}_g) - \mathbf{X}_g,\)
and identify the group whose residuals exhibit the weakest dependence on the remaining variables. This group is then designated as a \emph{sink node} and is subsequently removed from the set of variables.

We measure dependence via the HSIC~\citep{Gretton2005},
i.e.,
\(\text{HSIC}((\mathbf{X}_j)_{{j\in S \setminus \{g\}}}, \mathbf{R}_g)\).
When equipped with characteristic kernels \citep{Fukumizu2007}, the HSIC equals
zero if and only if the random quantities involved are (unconditionally) independent. In particular, the HSIC is applicable to random vectors of general dimension.

\begin{remark}
  \noindent We emphasize that direct comparisons of HSIC values are meaningful only when all models under consideration share the same dimension and all involved quantities are measured on an identical scale.
\end{remark}
