\section{Experiments}\label{sec:experiments}

In this section, we evaluate \name by addressing two key questions: 1) \textit{Does a causal graph help \name identify the root cause?} 2) \textit{How quickly can \name find the root cause?} We then discuss our implementation setup and present the results. We provide additional results in Appendix~\ref{app:additional-experiments}.

\begin{figure*}[!t]
    \centering
    \subfigure[Top-$l$ accuracy of \name compared to baselines.]{%
        \includegraphics[width=0.65\textwidth]{figures/top-l-accuracy.pdf}
        \label{fig:top-l-accuracy}
    }
    \subfigure[Top-1 accuracy with varying samples]{%
        \includegraphics[width=0.32\textwidth]{figures/int_samples_recall.pdf}
        \label{fig:int-samples}
    }
    \caption{The results demonstrate that \name with \name-2 consistently achieves higher accuracy than RCD. While MI struggles due to its inability to condition on the parents of each node, whereas RCD can condition on other nodes but lacks information about the causal structure. In contrast, \name overcomes these challenges by learning a causal graph and using CMI to rank the nodes effectively.}
    \label{fig:result}
\end{figure*}

\textbf{Implementation.} To generate experimental data, we followed a streamlined approach~\citep{ikram2022root, lin2024root}, using \texttt{pyagrum}~\citep{hal-03135721} to create random causal graphs. We then generated samples for both observational and interventional settings by perturbing the data generation process of a randomly selected node. To ensure robustness, each experiment was repeated 100 times, with results reported as the mean and standard error. In RCA literature, a key metric for evaluating effectiveness is accuracy at top-$l$, defined as the probability of identifying the root cause within the top $l$ ranked causes. Hence, we report top-$l$ accuracy along with the execution runtime.

We implemented the following baselines:
\begin{itemize}
    \item \textbf{RUN}~\citep{lin2024root}: It constructs a causal graph using neural Granger causal discovery with contrastive learning. It ranks the nodes by PageRank with a personalized vector according to the learned graph. 
    \item \textbf{BARO}~\citep{pham2024baro}: A non-causal approach that ranks root causes by computing a score for each variable, based on the absolute difference between each post-failure sample and the median of pre-failure data.
    \item \textbf{SMOOTH}~\cite{okati2024root}: A recent work that tries to find the root cause given a complete causal graph.
    \item \textbf{MI}: A simple approach that sorts each node based on its mutual information with \fnode.
    % \item \textbf{cRCA}~\citep{xin2023causalrca}: It uses DAG-GNN \citep{yu2019dag} to learn a causal graph and then rank the root causes by applying PageRank to the learned graph. 
    \item \textbf{RCD}~\citep{ikram2022root}: A recent method that uses CI tests to identify the root cause.
    % \item \textbf{DAG}: A prototype of Algorithm~\ref{alg:sample-version} that uses the true DAG as the causal graph.
    \item \textbf{\name}: A prototype of Algorithm~\ref{alg:sample-version} that uses a CPDAG.
\end{itemize}

To demonstrate the value of graphical structure, we first present an experiment where all baselines used the ground truth graph as input. The results with graphs learned from data are shown in Appendix~\ref{app:sample-version-result}. We also compare two variants of~\name: M-IGS (Modified IGS)\footnote{For IGS, we referenced the recent findings from the POMS paper~\citep{shangqi2023partial}, but the authors declined to share their code in a way that can be made public. Consequently, we implemented an older, simpler version from~\citep{tao2019interactive}. For a runtime comparison, please see Theorem~\ref{app:shang_theorem} and~\ref{app:tao_theorem} in Appendix.}, which takes a DAG as input and identifies the root cause per Lemma~\ref{lem:reduction}; and RCG(CPDAG), which uses the essential graph generated by the PC algorithm~\citep{spirtes2000causation}. Furthermore, we used 10,000 samples for the normal period and only 1,000 samples for the post-failure dataset.
% The goal of this experiment is to demonstrate the value of a causal graph to effectively diagnose a failure.

Figure~\ref{fig:top-l-accuracy} shows the top-$l$ accuracy of different approaches with $l=1/3/5$. Although M-IGS offers the lowest runtime among all CI-based methods, its accuracy drops sharply. This decline stems from a key limitation: \textit{M-IGS assumes perfect CI tests}, but in practice, test results can be unreliable due to limited sample availability. As a result, M-IGS often makes incorrect decisions, leading to poor performance, especially as the number of nodes increases. This weakness is particularly evident when comparing M-IGS with SMOOTH. Both methods operate on a fully known causal graph (DAG), but SMOOTH, being score-based, ranks variables using the anomaly scores of their parents and is more robust to noisy CI tests. Consequently, it outperforms M-IGS under imperfect conditions. This observation underscores a critical point: while IGS has strong theoretical guarantees, its practical performance suffers in the presence of noisy CI tests, where even a single error can propagate and significantly impact results.

Similarly, RUN performs poorly due to its PageRank personalization algorithm, which incorporates arbitrary constraints not applicable to our experimental setup, such as assuming that leaf nodes are more likely to be the root cause. As a result, even with the ground truth DAG, RUN fails to identify the root cause.
Also, BARO, which relies solely on the absolute difference between normal and failure periods, performs poorly because failures can propagate across multiple nodes in a graph, often having a more pronounced impact on child nodes than on the root cause itself.
When comparing RCD and \name, we find that \name generally achieves better accuracy. With 100 nodes, RCG(CPDAG) identifies the root cause in the top-1 position with an accuracy of 87\%, surpassing RCD’s 78\%. This can further improve to 93\% when ranking top-5 nodes.
Similarly, when comparing RCG (CPDAG) with SMOOTH, we observe that RCG consistently outperforms SMOOTH. For instance, at 100 nodes, RCG achieves an accuracy of 87\%, compared to 78\% for SMOOTH in finding root cause in top-1. A key distinction between the two methods is that RCG(CPDAG) operates on a partially known causal graph, whereas SMOOTH requires a fully specified DAG as input.

% Nonetheless, the runtime can be reduced if an accurate sparse graph is learned during normal operations. This is evident from the runtime of~\name(CPDAG), where the input essential graph is learned from $k = n - 2$ (with $n$ as the number of nodes). Thus, \name offers a trade-off between the number of observational samples and the runtime for identifying the root cause post-failure. More observational samples result in a sparser graph, which increases runtime before the failure but ultimately reduces runtime \emph{after} the failure.

Since root cause analysis is often time-sensitive, we report the number of samples required for each approach to perform effectively. Instead of execution time, we focus on sample efficiency, as most graph-based methods can be parallelized in large clusters. Hence, the key question is how many samples need to be collected before an approach becomes effective. Nevertheless, we provide runtime comparisons in Appendix~\ref{app:int-samples-runtime}. Figure~\ref{fig:int-samples} shows the top-1 accuracy of three competing approaches on 25-node graphs with a varying number of interventional samples. RCG(CPDAG) consistently outperforms baselines due to its reliance on the CPDAG and its ability to orient edges after failure. Appendix~\ref{app:more-int-samples-accuracy} extends these results to 50- and 100-node graphs.
