\section{Related Work}\label{app:relatedwork}

\textbf{Root Cause Analysis in Microservices.}
Root Cause Analysis (RCA) is done both online~\citep{wang2023incremental} and offline~\citep{deng2021graph}, often relying on system dependency graphs~\citep{chen2014causeinfer}. Previous approaches have used statistical techniques, deep neural networks, and graph representation~\citep{brandon2020graph, capozzoli2015fault, ma2020automap}. For instance, \citep{lin2018microscope} uses z-scores to compare the distributions of normal operation and anomalous system data. The method finds the root cause by identifying nodes that deviate the most between two distributions, but it imposes normality assumptions on the data and it is sensitive to outliers. \citet{li2022causal} also uses similar techniques with a call graph provided by expert knowledge to adjust the scores. \citet{pham2024baro} improves this idea by using median and interquartile range instead, but the method does not leverage any causal knowledge. \citet{wang2023interdependent} used both individual and topological time series data to capture interdependencies between microservices, while \citet{xin2023causalrca} introduced a gradient-based causal structure learning method to generate weighted causal graphs and developed a root cause inference method called CausalRCA. \rev{\citet{strobl2023identifying} assume an invertible non-linear SCM with additive noise terms such as non-Gasussian error terms and quantify the root cause contributions using Shapley values based on conditional distributions on the error terms. 
\citet{strobl2024counterfactual} later extends it to settings where counterfactual distributions can be derived. \citet{budhathoki2022causal} assume a causal graph with fully specified functional relationships to quantify the contribution of each variable to the target outlier score. \citet{budhathoki2022causal} define a root cause to be the variable where its value is being detected to be an outlier among all the values that are jointly observed by other variables. Later,  \citet{okati2024root} extends it by providing a more efficient method when the causal graph is known and a heuristic when the causal graph is unknown under the single root cause assumption. \citet{okati2024root} also assumes there is only a single observation available in the anomalous regime.} \citet{assaad2023root} uses a fully oriented acyclic summary causal graph with loops learned from time-series data or provided by experts. They impose linearity for each directed edge in the given graph. Recently, \citet{lin2024root} proposed RUN, a method that forecasts time series by constructing a neural network for each system metric and then uses the forecasted data to build a Granger causal graph. During the diagnosis stage, RUN, like other algorithms, applies a weighted personalized PageRank algorithm to traverse the graph and identify the root cause. A closely related work to ours is RCD~\citep{ikram2022root}, where \citet{ikram2022root} presented a causal framework that treats failure as an intervention. They developed a hierarchical approach to causal discovery by randomly partitioning the set of observed variables and using a series CI tests in each partition to produce a set of potential root causes. This approach is particularly relevant to our work, as it also employs CI tests to localize and pinpoint the failure’s root cause. However, despite the innovative contributions of these recent studies, we argue that a critical aspect has been overlooked: the opportunity to utilize normal operation periods to develop a more efficient and effective RCA method for failure periods.

\rev{\textbf{Causal Discovery with Background Knowledge for RCA.} One common approach to learning the causal structure is to incorporate expert knowledge~\citep{chakraborty2023causil, gong2024porca, lin2024root, xin2023causalrca}. However, it may not always be feasible to obtain expert knowledge. A data-driven approach to causal structure learning then becomes a more viable solution. A key point is that learning causal structures does not require interventional data~\citep{spirtes2000causation, chickering2002optimal, shimizu2006linear, zheng2018dags}. We can leverage the vast amounts of observed data generated during the system's normal operation to construct the causal graph, rather than waiting for a failure.} 

\rev{
\textbf{Causal Discovery with Observational Data.} A wide class of causal discovery algorithms that learn a CPDAG is mainly score-based or constraint-based \cite{chickering2002optimal, spirtes2000causation}. For score-based methods, learning a causal structure can be extremely time-consuming~\citep{chickering2004large}. Fortunately, there are recent advances that speed up the processing of constructing an essential graph in the score-based methods ~\citep{chickering2020statistically, ramsey2017million, nazaret2021extremely}. \citet{lam2022greedy} also provide an efficient algorithm named GRaSP to exploit permutation reasoning to search for a causal graph that is guaranteed to be in the Markov equivalence class of the ground truth. \citet{andrews2023fast} extends GRaSP by using Grow-Shrink Tree to make the algorithm more accurate and scalable. With further assumptions, \citet{montagna2023scalable} also provides a scalable causal discovery algorithm based on  score matching to recover a causal graph.  

Given that the use of CI tests is a central aspect of our work, we also provide a brief overview of recent advances in causal discovery, particularly those focused on using CI tests. Causal discovery often relies on a series of CI tests to determine relationships between variables. However, this approach can be problematic, as the statistical power of CI tests diminishes with a finite sample size or when the conditioning set is large~\citep{shah2020hardness}. Also, they often involve conditioning on large sets of nodes to identify possible separating sets for each node~\citep{spirtes2000causation}. This time-consuming aspect of causal discovery is particularly undesirable in our context, where time is critical following a failure, and the goal is to quickly pinpoint the root cause. A promising direction in addressing this issue has been the exploration of methods to restrict the size of the conditioning set. In the absence of latent confounders, \citet{wienobst2020recovering} introduced a sound and complete algorithm known as Low-Order Causal Inference (LOCI), which learns a graphical representation based on CI relations of order $k$ or lower. Similarly, \citet{kocaoglu2023characterization} provided a novel characterization of the graphical representation termed the $k$-essential graph, along with a sound learning algorithm to construct it. Building on these ideas, \citet{lee2024constraint} proposed an approach that further restricts the conditioning sets for all CI tests so long these tests include all marginal tests. Our work integrates these recent advancements to develop and utilize a more robust causal graph than the current state-of-the-art in RCA literature. 
}



% \textbf{Causal Discovery with Bounded Conditioning Set Size.} Given that the use of CI tests is a central aspect of our work, we provide a brief overview of recent advances in causal discovery, particularly those focused on bounding the size of CI tests. Causal discovery often relies on a series of CI tests to determine relationships between variables. However, this approach can be problematic, as the statistical power of CI tests diminishes with a finite sample size or when the conditioning set is large~\citep{shah2020hardness}. A promising direction in addressing this issue has been the exploration of methods to restrict the size of the conditioning set. In the absence of latent confounders, \citet{wienobst2020recovering} introduced a sound and complete algorithm known as Low-Order Causal Inference (LOCI), which learns a graphical representation based on CI relations of order $k$ or lower. Similarly, \citet{kocaoglu2023characterization} provided a novel characterization of the graphical representation termed the $k$-essential graph, along with a sound learning algorithm to construct it. Building on these ideas, \citet{lee2024constraint} proposed an approach that further restricts the conditioning sets for all CI tests so long these tests include all marginal tests. Our objective in this work is to integrate these recent advancements to develop and utilize a more robust causal graph than the current state-of-the-art in RCA literature.
