\section{Introduction}
Root Cause Analysis (RCA), which aims to understand the root cause of failures, is crucial for ensuring the reliability and stability of production systems in diverse domains, including but not limited to medicine~\citep{kellogg2016our, latino2015effectiveness}, telecommunications~\citep{schaaf2015towards}, and IT operations~\citep{whitney2013root, drasar2019operations}. In cloud applications, particularly those using microservice architectures, the challenges of RCA are even more pronounced. The large number of microservices complicates pinpointing the primary cause of failures \citep{netflix22}, and the interdependent nature of these services means that a failure in one can cascade, disrupting the entire network. These factors make timely and accurate diagnosis of failures particularly difficult. According to~\citet{wang2018cloudranger}, identifying the root cause of issues in platforms like IBM's Bluemix can take an average of three hours without automated tools. Therefore, rapid fault detection is essential for minimizing downtime and mitigating impact on system performance. Delays in diagnosing issues can lead to substantial financial losses and customer dissatisfaction, especially as service-level agreements often prioritize system availability as a key performance indicator.

Recent RCA research has focused on developing methods to detect the root cause of failures, often through a two-phase process: first, constructing a graph structure and then ranking the nodes within that graph. Some approaches rely on expert knowledge to build the graph~\citep{ma2020automap}, while others derive it from service call graphs~\citep{chakraborty2023causil}, or employ deep neural networks for graph learning~\citep{lin2024root}. The goal is to model relationships and dependencies between services using causal discovery techniques to construct a causal graph~\citep{wang2018cloudranger, qiu2020causality, gan2021sage, ikram2022root, xin2023causalrca}. For instance, MicroCause~\citep{meng2020localizing} employs the PC algorithm to learn a causal graph from service metrics. However, the resulting graph is often an equivalence class with undirected edges, prompting researchers to arbitrarily convert it into a directed acyclic graph (DAG). RUN~\citep{lin2024root}, for example, removes the edge between two nodes with the lowest correlation, but this method does not guarantee the representation of the true underlying graph. In the second phase, existing algorithms \emph{rank} all nodes using graph centrality measures such as random walk~\citep{wang2018cloudranger, ma2020automap}, PageRank~\citep{wu2021microdiag, xin2023causalrca, lin2024root}, BFS~\citep{lin2018microscope}, and DFS~\citep{chen2014causeinfer}. However, many rely on arbitrary objective functions that may not accurately reflect the failure propagation chain. For example, Groot~\citep{wang2021groot} assumes that sink nodes are more likely to be root causes and assigns them different scores than others. 

% In causal discovery-based approaches,~\citet{ikram2022root} recently observed that a fault alters the generative mechanism of the failing node. This observation frames the fault as an intervention on the node, classifying the data generated during the failure period as a post-interventional dataset. Building on this idea, the authors introduced RCD (Root Cause Discovery), which leverages established techniques to identify the interventional target \ie the root cause of the failure. RCD does not learn the causal graph but only uses conditional independence (CI) tests to find the interventional target. An example demonstrating the execution of RCD is provided in the Appendix~\ref{app:samplerun-rcd}.
% \red{(I think this whole paragraph should be put to the related work to trim down the introduction and put a clean section on related work. Also, we cited RCD when we said ``The goal is to model relationships and dependencies between services using causal discovery techniques to construct a causal graph", but here we say it does not learn the causal graph. If you do want to keep this paragraph, we need to backup the reason why this is a big deal and what is the gap we are addressing. For example, we can extend the next paragraph by saying the existing work does not show whether such observational graph will benefit root cause analysis, contrasting with automap and RCD.)}

% Despite the existing literaure, we assert that current RCA methods overlook a crucial opportunity: they fail to utilize the system's normal operation time. Although identifying the root cause of a failure is a time-sensitive task \emph{once the failure occurs}, the period \textit{before the failure} offers ample time for preparation. 

During normal operations, site engineers or RCA systems can proactively prepare for potential failures by learning cause-effect relationships through domain knowledge or causal discovery from observational data, a topic extensively explored in the literature \citep{spirtes1991algorithm, spirtes2000causation, chickering2002optimal, peters2013causal, zheng2018dags, lam2022greedy}. In this context, observational data refers to metrics collected before a failure occurs, while post-interventional data pertains to metrics gathered after the failure. Recent work by \cite{budhathoki2022causal} leverages a graph from normal hours and allows anomalous samples from multiple distributions but assumes known, invertible functional relations, which are hard to estimate. \cite{okati2024root} extends this by relaxing the functional assumptions. Nonetheless, a fully known causal structure can be difficult to obtain in practice, especially in a large-scale system. Although \cite{okati2024root} has proposed a score function for RCA without any causal knowledge, it remains unclear how to incorporate existing causal discovery algorithms, which often give a partial causal structure. 
% discuss the method about TOCA and how they differ from our setup.

\textbf{Our contribution.} In this paper, we introduce a novel algorithm, \textbf{Root Cause Analysis with Causal Graphs (\name)}, which uses a system's normal operational period to proactively prepare for potential failures. To the best of our knowledge, this is the first work to show the advantages of using a \textit{partial} causal structure learned from normal operation data for RCA. We achieve this by showing a reduced number and order of conditional independence (CI) tests required by our proposed algorithm against state-of-the-art RCA methods based on CI tests. We begin by examining the simplest case, where there is a single root cause and the causal relationships are fully known in the form of a DAG. Interestingly, we show that identifying the root cause in a DAG is equivalent to solving a well-established graph theory problem known as Interactive Graph Search (IGS) \cite{tao2019interactive}, with minor modifications. This reduction to IGS provides a novel insight: a logarithmic number of marginal invariance tests relative to the number of variables is sufficient to identify root causes given a causal graph. For cases of multiple root causes, we propose another algorithm to learn the root cause of failure. Our algorithm exploits causal knowledge that is learned offline and can be represented as a mixed graph. For example, it can accept a CPDAG, a mixed graph with directed and undirected edges that represents a set of causal graphs containing the true graph. It can also accept partial causal structures obtained by testing a smaller set of conditional independences from data such as LOCI \cite{wienobst2020recovering}, or the recently proposed $k$PC \cite{kocaoglu2023characterization}, which are shown to be effective in the data-scarce regime. We summarize our contributions below.
% We propose learning a causal graph from observational data collected during regular operations. This learned graph is then used proactively to determine which invariance tests should be conducted post-failure.
% \rev{To the best of our knowledge, this is the first work to explore how learning even a partially observable causal graph can enhance the efficiency of root cause analysis. Furthermore, we demonstrate how to integrate this causal knowledge into RCA without relying on arbitrary assumptions about the structure of the system. We further demonstrate how to incorporate causal knowledge into RCA without relying on arbitrary assumptions about the system's structure. Additionally, we propose an information-theoretic method to rank potential root causes, offering a more practical and effective solution.}
% % \red{(I think this is weak in terms of claiming contributions at least for causal discovery people. It's like saying our contribution is to use PC during normal period. We should emphasize the way we use these graphs is a principal way of ranking root causes not just saying getting a graph from normal time is a contribution. This paper also shed lights on whether a causal graph is benefits for more efficient root cause analysis given the existing results like RCD. )}

% \rev{We begin by exploring the simplest case, where there is a single root cause and the causal relationships are fully known—that is, when the causal graph is a DAG}. Interestingly, we show that identifying the root cause in a causal DAG is equivalent to solving a well-established graph theory problem known as Interactive Graph Search (IGS)~\citep{tao2019interactive}, with minor modifications. This reduction to IGS not only enables us to leverage its logarithmic computational complexity but also establishes a lower bound on the number of CI tests required.
% \rev{However, learning a complete causal DAG is often impractical in real-world scenarios. In RCA literature, the focus extends beyond identifying a single root cause to ranking all nodes in the system. To address this challenge, we propose a more practical approach that leverages a partially observed causal graph, which can be learned from observational data.}
% Instead of arbitrarily converting a partial causal graph to a DAG, we propose a systematic approach to traverse the graph structure for root cause identification. Moreover, we note that existing causality-based methods, such as RCD~\citep{ikram2022root}, typically rely on higher-order CI tests, which involve testing with large conditioning sets. This can significantly diminish the statistical power of CI tests, especially with finite sample sizes. Although RCD attempts to reduce this issue by partitioning nodes into smaller subsets, it does not guarantee a meaningful decrease in the number of required CI tests.
% \rev{In contrast, our propose method uses $\mathcal{C}$-PC which allows the one to control the maximum size of the conditioning set.}
% \red{(The connection between our contributions and RCD is missing, here we simply say we use low-order tests)}.
% Azam: I mention in the next paragraph how we address these issues in RCG.

% To mitigate these challenges, we try to minimize the use of higher-order CI tests by limiting the size of the conditioning sets~\citep{spirtes2001anytime, rohekar2021iterative}. Our approach uses the~\cpc~algorithm~\citep{lee2024constraint}, which constrains conditioning set sizes to learn a partial causal graph, thereby reducing errors from limited statistical power in CI tests on finite samples.
% \red{(this will give reviewers the impression that we only apply someone else's work. We need to rephrase it in a way to tell them what we have done in a high-level.)}
% % Azam: I don't know how else to say it. We are essentially using cpc with our slightly modified version.
% Furthermore, we propose an algorithm that leverages this partial causal graph to identify the root cause of failures. Consequently, we demonstrate that even with incomplete graph knowledge, it is possible to accurately pinpoint the root cause by using at most a linear number of marginal invariance tests.


% Learning the causal DAG of system is often challenging in real-world applications. Therefore, we explore how to utilize a partial causal structure learned from the data to perform RCA. Instead of arbitrarily converting these equivalence classes into DAGs, we propose a more nuanced treatment of the graph structure for root cause identification. Specifically, we observe that existing causality-based solutions, such as RCD~\cite{ikram2022root}, often rely on higher-order CI tests, which require testing with large conditioning sets. This can significantly reduce statistical power when working with finite samples. Although RCD attempts to mitigate this issue by partitioning nodes into smaller subsets, it does not guarantee a substantial reduction in the number of required CI tests. To mitigate this, it is recommended to minimize the use of higher-order CI tests by limiting the size of the conditioning set~\citep{spirtes2001anytime, wienobst2020recovering, rohekar2021iterative, kocaoglu2023characterization, lee2024constraint}. Our approach leverages a known algorithm, \cpc~\citep{lee2024constraint}, which limits the size of the conditioning set to learn a partial causal graphical structure. This mitigates the errors that arise from the limited statistical power of CI tests with finite samples. We show that even with partial graph knowledge, it is possible to identify the root cause with at most linear number of marginal invariance tests.

% Determining the complete causal DAG of a system is often challenging in real-world applications. However, site engineers can learn partial cause-effect relations, either through domain knowledge or causal discovery from observational data, a topic extensively explored in the literature~\citep{spirtes1991algorithm, spirtes2000causation, chickering2002optimal, peters2013causal, zheng2018dags, lam2022greedy}. In system monitoring, observational data refers to metrics measured before a failure manifests, while post-interventional data refers to metrics collected after the failure. Nevertheless, the causal graphs learned from data typically form an equivalence class rather than a fully specified DAG. Instead of arbitrarily converting these equivalence classes into DAGs, we propose a more nuanced treatment of the graph structure for root cause identification. Specifically, we observe that existing causality-based solutions, such as RCD~\cite{ikram2022root}, often rely on higher-order CI tests, which require testing with large conditioning sets. This can significantly reduce statistical power when working with finite samples. Although RCD attempts to mitigate this issue by partitioning nodes into smaller subsets, it does not guarantee a substantial reduction in the number of required CI tests.  To mitigate this, it is recommended to minimize the use of higher-order CI tests by limiting the size of the conditioning set~\citep{spirtes2001anytime, wienobst2020recovering, rohekar2021iterative, kocaoglu2023characterization, lee2024constraint}.

% \textbf{Our contribution.} To address these challenges, we introduce a novel algorithm named \textbf{Root Cause analysis with causal Graphs~(\name)}. \name learns a causal graph from observational data collected during normal operations which is used preemptively to decide on which invariance tests should be performed post-failure. Our approach leverages a known algorithm, \cpc~\citep{lee2024constraint}, which limits the size of the conditioning set to learn a partial causal graphical structure. This mitigates the errors that arise from the limited statistical power of CI tests with finite samples. We show that even with partial graph knowledge, it is possible to identify the root cause with at most linear number of marginal invariance tests.
% This relies on the equivalence class that can be learned from data, which affects the conditioning set size. We can minimize the size of the conditioning set if the graph is fully known. This size increases with partial graph knowledge. We do so by estimating CMI for each node, conditioned on its possible parent set based on a partial causal structure. Using the CMI scores, \name then ranks all variables in the graph, producing an ordered list of nodes as potential root causes. This strategy allows us to capitalize on the normal operation period to enhance efficiency during failure diagnosis. Furthermore, by relying on lower-order CI tests to learn the partial graphical structure, \name achieves better sample efficiency compared to the SOTA methods. Our contributions are summarized as follows:




\begin{enumerate}
    % We don't propose a new algorithm. We use IGS to do RCA.
    \item Considering a single root cause and given a complete causal structure of a system, we map the problem of RCA to IGS. We further provide an algorithm that identifies the root cause with $\mathcal{O}(\log_{2}(n) + d\log_{1+d}n)$ number of marginal invariance tests and show that any algorithm that solely relies on marginal invariance tests for RCA must perform $\Omega(\log_{2}(n) + d\log_{1+d}n)$ many tests, where $n$ is the number of variables and $d$ is the maximum degree in the graph.
    % We don't learn the causal graph. We used a learned causal graph to do RCA when the DAG is unknown.
    \item In scenarios with multiple root causes, we propose an algorithm that leverages causal knowledge represented as a mixed graph (\eg CPDAG) learned \textit{before} the failure. The algorithm efficiently finds the separating set based on the estimated structure along with an information-theoretic approach to identify the true root causes of failure. We also prove its soundness for RCA given a partial causal structure.
    \item We validate the performance of our proposed algorithm by showing its higher accuracy relative to state-of-the-art methods, such as RCD~\citep{ikram2022root}, RUN~\citep{lin2024root}, and BARO ~\citep{pham2024baro}, through experiments on a real-world production-level application, which has a large number of variables with limited failure samples.
\end{enumerate}
