
\paragraph{Setup}
\begin{figure}
  \includegraphics[width=.48\textwidth]{Figures/benchmark_results_scaling.pdf}
  \caption{Performance of the algorithms under varying dimensions of the GANM.}
  \label{fig:benchmark_scaling}
\end{figure}

In this section we assess and compare the performance of GroupRESIT with MURGS (\textit{GRESIT-MURGS}), GroupRESIT with greedy independence testing (\textit{GRESIT-IND}), a grouped version of the PC algorithm~\citep{Spirtes1993} \textit({GPC}) and a grouped version of GraN-DAG~\citep{Lachapelle2020} (\textit{GGraN-DAG}). Furthermore, we report a baseline algorithm that picks a causal order at random and subsequently applies MURGS (\textit{GRandReg}). Implementation details—including our modifications to the PC algorithm and GraN-DAG to accommodate the group setting, along with a description of the metrics, synthetic data, and a table of hyperparameters—are provided in Section~\ref{sec:simulation_details}.

\begin{figure*}
  \centering
  \includegraphics[width=.82\textwidth]{Figures/result_boxplot_10_2_2000.pdf}\\
  \includegraphics[width=.82\textwidth]{Figures/result_boxplot_15_5_2000.pdf}
  \caption{Simulation results based on \(20\) repetitions.}
  \label{fig:sim_results}
\end{figure*}

\paragraph{Results}

Results from our experiments are shown in Figures~\ref{fig:benchmark_scaling} and~\ref{fig:sim_results}. All metrics are averaged over \(20\) independent simulation runs. Synthetic data is generated from GANMs where nonlinear functions are generated from weighted sums of Gaussian processes.

Focusing on AAID, Figure~\ref{fig:benchmark_scaling} illustrates the algorithms' performance across a range of node sizes \(p\), sample sizes \(n\) and group sizes \(d_j\). Across all settings, \textit{GRESIT-MURGS} consistently outperforms \textit{GGraN-DAG}, although it too exhibits challenges when applied to very large graphs with limited sample sizes. The performance difference between \textit{GRESIT-MURGS} and \textit{GGraN-DAG} is particularly pronounced when the group size is varied. Indeed, \textit{GRESIT-MURGS} retains its good performance across different group sizes.

In Figure~\ref{fig:sim_results}, we present a comparison across all considered metrics for two fixed choices of \(p, d_g\), and \(n\). The first row displays classification metrics, where higher values indicate better performance, while the second row shows graph distances, where lower values indicate more accurate graph recovery. In every case, \textit{GRESIT-MURGS} outperforms the other methods. Notably, \textit{GPC} deteriorates as the group size increases, and \textit{GGraN-DAG} suffers similarly, albeit to a lesser extent. Regarding graph distances, the AAID metric is particularly informative, as it reflects the quality of the estimated causal order; here, both GroupRESIT procedures exhibit a clear advantage. In contrast, the pruning phase in \textit{GRESIT-IND} proves ineffective, as evidenced by high SHD.

Overall, the combination of flexible neural networks and nonparametric independence tests proves highly effective in estimating a causal order, provided that the sample size is sufficiently large. Moreover, MURGS demonstrates a robust capability for feature selection even in the presence of large group sizes and numerous candidate parents.
