We have developed a Python library that implements all the methodologies presented in this work, including our simulation study. This library is available under the following link \url{https://github.com/boschresearch/gresit}.

\subsection{Synthetic data generation}\label{sec:synthetic_data}

Synthetic data used throughout the experiments is generated from GANMs (Definition~\ref{def:ANM}) with varying dimension and group size. First, we construct the ground truth causal graph using the Erd\"{o}s-R\'{e}nyi model~\citep{Erdos2011} with sparsity level set proportional to the number of nodes. In order to set up the nonlinear link functions \({f}_g\), we follow a strategy similar to previous work on scalar ANMs~\citep{Lachapelle2020, Uemura2022, Rolland2022}. We sample \({f}_g\) from randomly weighted sums of Gaussian processes.

Additive noise is generated from multivariate Log-Normal distributions. First, multivariate Gaussians are generated where mean vector entries and off-diagonal entries in the covariance matrix are sampled uniformly from the \([-0.8,0.8]\) interval. Then the exponential is taken component-wise. While propagating through the graph, we adjust the scales of the variables in each equation to a signal-to-noise ratio of two. To reduce the risk of involuntary design patterns that might be picked up by the algorithms we employ~\citep{Reisach2021}, data is always standardized.

\subsection{Metrics}\label{sec:metrics}

Evaluating causal graphs is a difficult task as the mere number of dissimilar edges between learned and the ground truth graph does not necessarily reflect how the graphs differ in their capability to answer causal queries. This fact is exacerbated by the problem that not all causal discovery algorithms return the same graphical object. Different causal assumptions posed by the corresponding routines lead to different types of causal graphs. Consequently, we report a number of evaluation metrics in order to capture strengths and weaknesses of the routines employed across a wide spectrum of tasks. Next to metrics that quantify edge recovery, we report recently proposed~\citep{Henckel2024} graph distances that count the number of wrongly inferred causal effects, determined by different identification strategies, when using the learned rather than the true DAG.

\textit{Precision, Recall and \(F_1\) score}

We start with binary classification metrics that are prominent in the machine learning literature. Put in a graphical context, \textit{Precision} refers to the fraction of correctly
determined edges among all identified edges. Recall, sets correctly identified edges in relation to all edges present in the
ground truth. The \(F_1\) score is the harmonic mean of Precision and Recall, i.e. \(F_1 =
2 \cdot\text{Precision}\cdot\text{Recall} /(\text{Precision}+\text{Recall})\).

\textit{Structural Hamming Distance (SHD)}

The SHD counts the number of differing edges between to graphs.

\textit{Structural Interventional Distance (SID)}~\citep{Peters2015}:

The SID counts the number of incorrectly inferred interventional distributions from the learned graph when compared to the ground truth. For DAGs, this equates to counting parent sets in the learned DAG that are not valid adjustment sets in the ground truth DAG.

\textit{Ancestor Adjustment Identification Distance (AAID)~\citep{Henckel2024}}

While parent adjustment is a valid adjustment strategy, there exist statistically more efficient adjustment sets. Based on this observation, \citet{Henckel2024} generalize the idea of the SID and present identification distances that count the number of incorrect identification formulas that arise when using some identification strategy. The SID then arises when parent adjustment is chosen as the identification strategy. However, SID may produce surprisingly large quantities when comparing two DAGs that have the same causal order. Choosing ancestor rather than parent adjustment leads to an adjustment strategy that returns a zero whenever estimated and ground truth DAG agree in terms of their causal orders.

\textit{Order Adjustment Identification Distance (OAID)~\citep{Henckel2024}}

In order to emphasize the role played by the pruning step, we also provide a DAG to order distance.~\textit{OAID} arises when comparing the super-DAG \(\mathcal{G}^\pi\) and the ground truth \(\mathcal{G}_0\) in terms of their \textit{AAID}. Consequently,  \textit{OAID} counts the number of incorrect identification formulas derived from the estimated causal order \(\pi\).

\subsection{Algorithms}\label{sec:algorithms}
\begin{table*}[ht]
  \centering
  \caption{Hyperparameter and tuning choices for all methods employed in this work.}
  \label{table:tuning}
  \begin{tabularx}{\textwidth}{@{}lX@{}}
    \toprule
    Method & Parameters \\
    \midrule
    \textit{GroupRESIT} &
    \texttt{regression}: MLP with tanh activation;
    \texttt{n\_epochs}=500;
    \texttt{lr}=0.01;
    \texttt{loss}=MSE;
    \texttt{batch\_size}=500;
    \texttt{indep\_test}=HSIC;
    \texttt{alpha}=0.01
    \\[0.8em]

    \textit{MURGS} &
    \texttt{smoother}=Gaussian kernel regression;
    \texttt{plugin bandwidth}=0.6\,sd$(X)\,n^{-1/5}$
    \\[0.8em]

    \textit{GroupPC} &
    \texttt{indep\_test}=Fisher’s Z;
    \texttt{alpha}=0.05
    \\[0.8em]

    \textit{GroupGraN-DAG} &
    \texttt{hidden\_num}=2;
    \texttt{hidden\_dim}=10;
    \texttt{batch\_size}=64;
    \texttt{lr}=0.001;
    \texttt{iterations}=100\,000;
    \texttt{model\_name}=NonLinGaussANM;
    \texttt{nonlinear}=leaky‐ReLU;
    \texttt{optimizer}=RMSProp;
    \texttt{h\_threshold}=1e-8;
    \texttt{lambda\_init}=0.0;
    \texttt{mu\_init}=0.001;
    \texttt{omega\_lambda}=1e-4;
    \texttt{omega\_mu}=0.9;
    \texttt{stop\_crit\_win}=100;
    \texttt{edge\_clamp\_range}=1e-4
    \\[0.8em]

    \textit{GroupLiNGAM} &
    \texttt{regression}=OLS;
    \texttt{indep\_test}=HSIC
    \\
    \bottomrule
  \end{tabularx}
\end{table*}

We employ the following causal discovery algorithms in our experiments.

\textit{GroupRESIT}

Owing to the modular design of GroupRESIT, one may, in principle, combine various pairs of multi-response regression methods and vector independence tests in the first phase. To ensure broad applicability, we employ neural networks—specifically, multilayer perceptrons (MLPs) with hyperbolic tangent activation functions and early stopping—for the multi-response regression. We use the mean squared error (MSE) loss, which yielded better performance than the HSIC loss \citep{Mooij2009,Greenfeld2020}. For the subsequent independence test, we apply the empirical HSIC~\citep{Gretton2005} with Gaussian RBF kernels, using the median heuristic for bandwidth selection.

In the second phase, we compare the performance of MURGS with that of the greedy independence criterion employed in the original RESIT framework \citep{Peters2014}. Any linear smoother can be used to estimate the conditional expectations during the backfitting procedure; in our experiments, we employ Gaussian kernel regression with a plug-in bandwidth \(h = 0.6\cdot\text{sd}(X)n^{-1/5}\). For greedy independence testing, we again apply the empirical HSIC, with \(p\)-values computed via the gamma approximation on a separate test dataset.

\textit{GroupGraN-DAG}

GraN-DAG~\citep{Lachapelle2020} is a score based algorithm developed to handle nonlinear relations among variables. GraN-DAG utilizes the continuous acyclicity constraint first suggested by~\citet{Zheng2018}. With the appropriate loss function, GraN-DAG can be tailored towards Gaussian nonlinear additive noise models. However, in order to ensure that GraN-DAG operates on a group level, we adapt the micro-level acyclicity constraint to encourage acyclicity on the corresponding group DAG.%, similar to \citet{Kikuchi2023}.

Recall that for variable groups \(\mathbf{X} = (\mathbf{X}_1, \ldots, \mathbf{X}_p)\) each group
\(\mathbf{X}_g \in \mathbb{R}^{d_g}\). Suppose we consider all group entries as scalar random variables such that we have \(m = \sum_{g=1}^p d_g\) many micro variables. GraN-DAG enforces acyclicity via the trace exponential of a weighted adjacency matrix \(A_\phi \in \mathbb{R}_{\geq 0}^{m \times m}\) that arises from the weights in the neural network. More specifically, the micro-acyclicity constraint amounts to \(h(A_\phi) = tr(e^{A_\phi \circ A_\phi}) - m = 0\), where \(\circ\) denotes the Hadamard product. Similar to \citet{Kikuchi2023}, we enforce acyclicity in the corresponding group DAG by the following weighted group adjacency matrix \(A_\phi^{\text{group}} \in \mathbb{R}_{\geq 0}^{p \times p}\) where
\begin{equation}
  (A_\phi^{\text{group}})_{gh} =
  \begin{cases}
    0 &\text{if } g=h\\ \frac{1}{d_g d_h}
    \sum_{i\in [d_g]}\sum_{j\in [d_h]} (A_\phi)_{i,j} &\text{o.w.}
  \end{cases}.
\end{equation}
By setting the diagonal to zero in \(A_\phi^{\text{group}}\), we ignore the graph structure induced
by \(A_\phi\) within the groups and only focus on the inter-group relations.
As we want to enforce an acyclic group graph, we use the following modified constraint in the
augmented Lagrangian method used in~\cite{Lachapelle2020}
\begin{equation}
  h(A_\phi^{\text{group}}) = tr(e^{A_\phi^{\text{group}} \circ A_\phi^{\text{group}}}) - p = 0.
\end{equation}
In general, the final weighted adjacency matrix in~\textit{GroupGraN-DAG} is not sparse. Therefore, appropriate thresholds need to be set to enforce strict zeros. Unfortunately, this might lead to a clipped adjacency matrix that need not necessarily encode a DAG. In such cases, we continue to select thresholds until the resulting weighted adjacency matrix becomes acyclic.

\textit{GroupPC}

The PC algorithm~\citep{Spirtes1993} performs conditional independence tests in a resource efficient way in order to remove edges from a fully connected undirected graph. Given the first phase, the algorithm orients as many edges as possible. We implement the stable version of the algorithm proposed by~\citet{Colombo2014}. In order to adapt the PC algorithm to the group setting, the involved tests need to be able to handle testing conditional independence among two random vectors given a set of random vectors. While~\citet{Zhang2009} extended the HSIC to conditional independence testing its prohibitively long runtime prevents us from using it in our experiments. Instead, we use the simple Fisher-Z scoring test and treat group entries individually. Then, we aggregate the coordinate-wise hypotheses based on scalar variables by using the union-intersection method. More precisely, in the skeleton-finding phase, we remove an edge if the union of the \(p\)-values of the individual tests is larger than the significance level \(\alpha\). Otherwise, the edge is retained. The significance level \(\alpha\) of the involved test acts as a hyperparameter. The smaller \(\alpha\) the larger the hurdle to keep an edge in the first phase such that sparser graphs will be returned. In general, the algorithm returns a completed partially directed acyclic graph (CPDAG). While in principle the new graph metrics developed by \citet{Henckel2024} return meaningful results for CPDAG to DAG comparisons, the same cannot be said for the remaining ones. In particular, comparability becomes difficult between those algorithms that return DAGs and the PC algorithm. Thus, we compute the SHD for each DAG consistent with the CPDAG and select the one with the smallest SHD.

\textit{GroupLiNGAM}

We implement the \textit{GroupDirectLiNGAM} algorithm of~\citet{Entner2012}, which extends the direct estimation method for LiNGAM introduced by~\citet{Shimizu2006,Shimizu2011} to handle vector-valued variables. Since~\citet{Entner2012} focus solely on the causal ordering step, we use MURGS to recover the full graph structure thereafter.

Table~\ref{table:tuning} reports all hyperparameters and tuning parameter choices for benchmark and real data results.

\begin{figure*}[ht!]
  \centering
  \includegraphics[width=.82\textwidth]{Figures/murgs_sim.png}
  \caption{Feature selection capability of MURGS. Node size is \(p=10,20\), with \(2,4\) active groups, respectively.}
  \label{fig:murgs_sim}
\end{figure*}
