\section{Problem Formulation}
A system has $n$ components $\mathcal{M} = \{m_{1}, \ldots , m_{n}\}$. Within a given time interval, the monitoring tool collects at least $d$ metrics from each of the components, i.e. $\mathcal{T}(i,t) = \{r_{i,1, t}, \ldots, r_{i,d,t}\}$, where $d \ge 1; \forall i \in \{1, \ldots, n\}$, $\mathcal{T}(i,t)$ is a set of $d$ metrics of component $i$ at time instance $t$. Considering the entirety of the data, we have two time series datasets defined as $\mathcal{D} = \{\mathcal{T}(1, 1), \ldots, \mathcal{T}(n, t- 1)\}$ and $\mathcal{D}^{\star} = \{\mathcal{T}(1, t), \ldots, \mathcal{T}(n, \gamma)\}$, where $t$ represents the time when the failure was first registered and $\gamma$ is the time when the issue was fixed. We consider the setting where one can learn some cause-effect relations in the form of a CPDAG at the time $s$ from $\mathcal{D}$, where $s < t$. We leverage this partial causal structure to pinpoint the root cause between timestamps $t$ and $\gamma$. 

% by modeling the failure as an intervention to certain variables within $\V$.


% A microservice-based cloud application has a set of $n$ microservices, $\mathcal{M} = \{m_{1}, \ldots , m_{n}\}$. Within a given time interval, the monitoring tool collects at least $d$ metrics from each of the microservices, i.e. $\mathcal{T}(i,t) = \{r_{i,1, t}, \ldots, r_{i,d,t}\}$, where $d \ge 1; \forall i \in \{1, \ldots, n\}$, $\mathcal{T}(i,t)$ is a set of $d$ metrics of microservice $i$ at time instance $t$. Considering the entirety of the data, we have two time series datasets defined as $\mathcal{D} = \{\mathcal{T}(1, 1), \ldots, \mathcal{T}(n, t- 1)\}$ and $\mathcal{D}^{\star} = \{\mathcal{T}(1, t), \ldots, \mathcal{T}(n, \gamma)\}$, where $t$ represents the time when the failure was first registered and $\gamma$ is the time when the issue was fixed. We consider the setting where one can learn a $\mathcal{C}$-essential graph $\varepsilon_{\mathcal{C}}(D)=(\V,\Eb)$ at the time $s$ from $\mathcal{D}$, where $s < t$ and $\mathcal{C}$ is the set of conditioning sets used for all CI tests, $\V$ denotes the set of $d$ metrics as random variables and $\Eb$ is the set of edges where $X_{i} \rightarrow X_{j} $ represents metric $X_{i}$ causes metric $X_{j}$. We leverage this partial causal structure to pinpoint the root cause between timestamps $t$ and $\gamma$, by modeling the failure as an intervention to certain variables within $\V$. 

\textbf{Failure as Interventions.} An important observation of this problem is to model a failure as a soft intervention on the failing mode~\citep{ikram2022root}.
% For example, the number of incoming requests for a microservice can change and be considered as a variable in a causal graph.
Here, the representation of F-NODE allows one to identify the distribution invariances $P_{N}(X|Pa(X))=P_{A}(X|Pa(X))$, where $P_{N}$ and $P_{A}$ are the distributions under normal mode of operation and anomalous operation respectively. By concatenating both of these datasets, one can sample from the distribution $P^\star$ of a set of observed variables $\V$ involving F-NODE, denoted as $F$, where $P^\star(\V|F=0)=P_{N}(\V)$ and $P^\star(\V|F=1)=P_{A}(\V)$. Under this formalism, the invariance $P_{N}(X|Pa(X))=P_{A}(X|Pa(X))$ corresponds to conditional independence between $X$ and $F$ given $Pa(X)$. Since F-NODE cannot have any incoming edges, one can then employ a series of CI tests on the sampling distributions $\hat{P^\star}$ to determine which node is the root cause $R$ (the child of F-NODE) by observing $(R\dep F| Pa(R))_{\hat{P^\star}}$. 

% Conducting an exponentially large number of CI tests like RCD post-failure is arguably not ideal. This is only the case becuase RCD operates in the absence of any causal knowledge. This can be seen by the fact that RCD only focuses on learning the adjanceny of F-NODE rather than learning the whole graph as that can be time consuming which is undersirable for RCA. However, we note that RCA is time sensitve only after the failure. The time before the failure allows ample time to prepare the system for failure. Therefore, we propose that one can learn the causal graph from the observational data and use it post-failure to effectively find the root cause. In the next sections, we will first discuss the benefits of having complete causal knowledge of the underlying data-generating mechanism. Then, we will present a more practical solution when the causal graph is unknown.

To demonstrate the advantages of leveraging normal operation time, we begin by examining the number and order of CI tests that can be reduced given a complete causal graph (\ie a DAG). Then, we will discuss a more practical solution for scenarios where only a partial causal structure is available during the failure period.
% Performing an exponentially large number of CI tests, as required by RCD, is far from ideal in post-failure scenarios. This is due to the fact that RCD operates without any prior causal knowledge. RCD focuses solely on identifying the adjacency of the F-NODE rather than learning the entire graph, as constructing the full causal structure can be time-consuming. However, it is important to note that RCA is time-sensitive \textit{only} after the failure occurs. The time leading up to a failure provides ample opportunity to prepare the system. Therefore, we propose leveraging this pre-failure window to learn the causal graph from observational data, which can then be used post-failure to effectively identify the root cause. In the following sections, we will first highlight the benefits of having complete causal knowledge of the underlying data-generating mechanism, followed by a more practical approach for cases where the causal graph is unknown.


% Conducting an exponentially large number of CI tests like RCD post-failure is arguably not ideal. It is preferable to implement measures during the normal operations of the system to facilitate efficient RCA. Therefore, the goal of our work is to significantly reduce the number of CI tests and hence the execution time needed to discover $R$ in $\V$ during the failure period. We accomplish this by leveraging the causal graph learned from observed data during the normal operation time of the system. We will first discuss the benefits of having complete causal knowledge of the underlying data-generating mechanism. Then, we will present a more practical solution when the causal graph is unknown.
