 \documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
%                                     version; also before submission to
%                                     see how the non-anonymous paper
%                                     would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

% Some suggested packages, as needed:
% \usepackage{natbib} % has a nice set of citation styles and commands
% \bibliographystyle{plainnat}
% \bibliographystyle{plain}
% \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{cleveref}

\usepackage{balance}

% ------------Ours starts---------------
\usepackage{xr-hyper}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{subfigure}


\usepackage{float}



\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newcommand{\indep}{\perp \!\!\! \perp}
\newcommand{\notindep}{\not \perp \!\!\! \perp}
\newlength\myindent
\setlength\myindent{2em}
\newcommand\bindent{%
  \begingroup
  \setlength{\itemindent}{\myindent}
  \addtolength{\algorithmicindent}{\myindent}
}
\newcommand\eindent{\endgroup}

\newcommand{\rcforest}[1]{$\mathbf{#1}$-rooted C forest }
\newcommand{\rcforests}[1]{$\mathbf{#1}$-rooted C forests }

\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}

\newcommand{\Y}{\mathbf{Y}}
\newcommand{\y}{\mathbf{y}}

\newcommand{\Z}{\mathbf{Z}}
\newcommand{\z}{\mathbf{z}}


\newcommand{\R}{\mathbf{R}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\newcommand{\GbarX}{G_{\overline{\mathbf{X}}}}
\newcommand{\red}[1]{{\color{red}#1}}
\newcommand{\blue}[1]{{\color{blue}#1}}
\newcommand{\DO}{\text{do}}
\newcommand{\mr}{\textcolor{orange}}


\usepackage{algorithm2e}
% \usepackage{algorithmic}
\usepackage[noend]{algorithmic}

% \DeclareMathOperator*{\argmin}{argmin}


%-------------Ours ends---------------



%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\title{Finding Invariant Predictors Efficiently via Causal Structure}


% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{\href{mailto:<lee4094@purdue.edu>?Subject=Questions about ID4IP paper}{Kenneth Lee}{}}
\author{Md Musfiqur Rahman}
\author{Murat Kocaoglu}
% Add affiliations after the authors
\affil{%
    School of Electrical and Computer Engineering\\
    Purdue University\\
    West Lafayette, Indiana, USA
}
%
%
\begin{document}

\maketitle
\begin{abstract}
One fundamental problem in machine learning is out-of-distribution generalization. A method named the surgery estimator incorporates the causal structure in the form of a directed acyclic graph (DAG) to find predictors that are invariant across target domains using distributional invariances via Pearl's do-calculus. However, finding a surgery estimator can take exponential time as the current methods need to search through all possible predictors. In this work, we first provide a graphical characterization of the identifiability of conditional causal queries. Next, we leverage this characterization together with a greedy search step to develop a polynomial-time algorithm for finding invariant predictors using the causal graph. Given the correct causal graph, our method is guaranteed to find at least one invariant predictor, if it exists. We show that our proposed algorithm can significantly reduce the run-time both in simulated and semi-synthetic data experiments and have predictive performance that is comparable to the existing work that runs in exponential time. 
\end{abstract}
%
\section{Introduction}
\label{introduction}
% \mr{What is the problem of invariant prediction and why do we care about it?}
One fundamental challenge in machine learning (ML) is to deploy an algorithm that generalizes well to unseen data. When the training data distribution and the target distribution differ, i.e., a distribution shift occurs, ML algorithms can make mistakes that have serious consequences in mission-critical applications in areas such as healthcare \cite{schulam2017reliable, zech2018variable}. Thus, an important goal in ML is to carefully select the features that can be used to train predictive algorithms  that perform well in new environments.
%%% I realized this will be too slow. Let's jump into domain adaptation and shift directly instead. keeping it for your reference.
%Generalization problem has traditionally been addressed in several different ways. Motivated by Occam's razor, one common way is to use regularization in the objective function to encourage sparse and simpler solutions. VC dimension is used to give generalization guarantees in predictive tasks given a function class used to perform the prediction task.  
%%%%
\par
There have been numerous studies to investigate distribution shifts using different tools.
% from different directions. 
%
\cite{duchi2022distributionally, slowik2022distributionally} evaluates their predictor performance under  mixture covariate shifts by modeling it as a distributionally robust optimization (DRO) problem(~\cite{ben2009robust}).
In this approach, they consider a lower bound of the proportion of the minority sub-population from a mixture model and minimize their worst-case subpopulation loss.
%---------
% \citet{duchi2022distributionally, slowik2022distributionally} evaluates their predictor performance under  mixture covariate shifts by modeling it as a distributionally robust optimization (DRO) problem. This connection appears to be advantageous since DRO performs robustly across subpopulations and environments~(\cite{ben2009robust}).
% In this approach, they consider a lower bound of the proportion of the minority sub-population from a mixture model and minimize their worst-case subpopulation loss.
% ------
% \citet{duchi2022distributionally, slowik2022distributionally} have modeled the distribution shift problem as a distributionally robust optimization (DRO) problem. \blue{This connection appears to be advantageous} since DRO performs robustly across subpopulations and environments~(\cite{rahimian2019distributionally}).
% ben2009robust
% In their work, they propose procedures that control performance over all large enough \blue{sub-populations, irrespective of the distribution of each subpopulation}.
% One of the recent methods that build predictors enabling out-of-distribution (OOD) generalization is invariant risk minimization~\cite{arjovsky2019invariant}. They estimate nonlinear, invariant, causal predictors from multiple training environments to perform well in new unseen test environments.
Another line of work deals with the distribution shift~(\cite{quinonero2008dataset, scholkopf2012causal})
% storkey2009training
% where they
by developing learning models that are stable against shifts due to changes in the data-generating mechanisms.
% \blue{Kenneth: \cite{quinonero2008dataset, scholkopf2012causal} develop learning models that are stable against shifts in how the data is generated.}
Researchers have considered the causal connection between features ($\mathbf{X}$) and the target variable ($Y$) to introduce methods to deal with different types of distribution shifts. Some examples include covariate shift where $P(\mathbf{X})$ changes~(\cite{gretton2009covariate,rezaei2021robust}), target shift where $P(Y)$ changes~(\cite{zhang2013domain, gong2018causal}) and conditional shift where $P(Y|\mathbf{X})$ changes~(\cite{gong2016domain}).
\par
% There are mainly two types of stable training algorithms in the literature that deal with different forms of distribution shifts: reactive and proactive. 
There are mainly two types of stable training algorithms in the literature that consider different forms of distribution shifts: reactive and proactive~(\cite{candela2009dataset}).
Reactive approaches consider datasets from the deployment environment to train and adjust their training algorithm accordingly by re-weighting the training data so that it performs better in the deployment environment~\cite{storkey2009training, gretton2009covariate}.
However, in many sensitive applications, we do not have access to every possible domain dataset. Under these circumstances, proactive approaches are preferable, as they are trained without any deployment data and prepared to perform well for any possible distribution shift~\cite{subbaswamy2018counterfactual, saria2019tutorial}.
% \blue{Kenneth: In contrast, proactive approaches do not use any deployment data \cite{subbaswamy2018counterfactual, saria2019tutorial}}
% where they can perform well
% These approaches anticipate and address the differences between training and deployment environments without using deployment data. 
% \blue{KENNETH: Remove this whole sentence: "These approaches anticipate and address the differences between training and deployment environments without using deployment data." }
Proactive algorithms train with stable information/features that are invariant across environments. 
% \blue{KENETH: They work by training algorithms with predictors that are invariant across environments.}
\par
Some recent proactive algorithms such as
\cite{rojas2018invariant, magliacane2018domain} find optimal conditioning sets containing causal, anti-causal, or confounded dependence (i.e., unobserved common cause) with the target by
hypothesis testing their stability across multiple data domains.
This makes the prediction invariant to the specific distribution shift.
% observed in training data 
Such algorithms model the causal relations among the features as a causal graph and model distribution shift via an auxiliary node that captures the shift as an intervention. This setup allows utilizing conditional independence relations to obtain stable predictors.
% \blue{KENNETH: just say "It is because these algorithms.." instead of "To achieve this advantage". }
\begin{figure}[t!]
  \centering
  \includegraphics[width=0.6\linewidth]{Figures/algorithm-details/motivating-example.pdf}
  \caption{A causal graph representing causal relations among features.
  We use $A,I,X,H$
  from a hypothetical $\DO(H)$ interventional distribution
  to predict \emph{Pneumonia} severity such that it stays invariant to distribution shift.
  % as $P(N|\DO(H),A,I,X)$ where $\DO(H)$ 
  % distribution shift 
  % for \emph{Pneumonia} severity prediction.
  }
  \label{fig:motivating-example}
\end{figure}
%
\par
% https://dl.acm.org/doi/pdf/10.1145/2783258.2788613
% A unifying causal framework for analyzing dataset shift-stable learning algorithms
% Causal machine learning for healthcare and precision medicine
% To understand the distribution shift problem better, 
Suppose we have access to the data-generating causal graph in Figure~\ref{fig:motivating-example}. 
% \blue{KENNETH: get rid of "To understand the distribution shift problem better" and just starts with "Suppose we have access..."}
% from the given training dataset.
 The admission criteria for \emph{ICU}($I$) is caused by \emph{Asthma (A)} having a confounding effect on \emph{Pneumonia (N)} severity and \emph{Hospital equipment} ($H$). The variable \emph{Xray (X)} is caused by \emph{Pneumonia} and the \emph{Hospital equipment}. 
We assume the feature $H$ is responsible for the distribution shifts and represent the shift with a discrete variable $S$ pointing to it. 
For example, $S=0$ could represent the training distribution, and $S=1$ could represent the distribution during deployment.
% To train a model that can predict $N$, we use variables $A$, $X$, and $H$ as features. 
According to most existing algorithms in the literature~(\cite{gretton2009covariate, zhang2013domain}), we can use features $A$, $X$, and $H$ to train a model that can predict $N$ as valid stable predictors since they cut off any dependence from $S$.
% to train a model that can predict $N$, we can use features $A$, $X$ and $H$ as valid stable predictors since they cut off any dependence from $S$ \blue{KENNETH: we can use features $A$, $X$ and $H$ to train a model that can predict $N$ as valid stable predictors since they cut off any dependence from $S$}.
Along with these features, $I$ might be a better predictor of $N$.
% contain important information 
% about $N$ and we might wish to utilize it for prediction.
Although $N$ and $H$ are independent for the mentioned predictors, they become dependent once we control for $I$ in the dataset, and eventually, $N$ becomes dependent on $S$. 
% \blue{KENNETH: Although $N$ and $H$ are independent, they become dependent once we control for $I$ in the dataset samples. Eventually, $N$ becomes dependent with $S$.} 
Therefore, if we wish to include $I$ in the stable predictors, previous approaches cannot suggest any solution to achieve that without creating dependence between $S$ and the target variable, i.e., any such predictor becomes domain-dependent. 
\par
Recently, \cite{subbaswamy2019preventing} proposed an algorithm that removes the dependence on any mechanism that is sensitive to the distribution shift using hypothetical interventions to find stable predictors known as \emph{graph surgery estimators}. 
Such interventions are not actually performed but simulated from the observational data via the identification algorithm~\cite{shpitser2008complete}. 
% to calculate the
 % causal query corresponding to the hypothetical interventions on the causal graph of the observational training data set.
In Figure~\ref{fig:motivating-example}, the query $P(N|\DO(H), A, I, X )$ can be uniquely calculated from training data and $\DO(H)$ d-separates $N$ from $S$. Thus we can train our model with predictors $I, X, H, A$ from a hypothetical $\DO(H)$ interventional distribution to predict $N$.  However, in order to find such predictors, \cite{subbaswamy2019preventing} iterates over the subsets of all variables and constructs an exponential number of conditional causal queries. Each such query requires one execution of the \textbf{ID} algorithm. 
This results in an exponential-time algorithm in the worst-case.
% Thus this approach will become inefficient and may take exponential running time in the worst-case scenario.
% \blue{KENNETH: However, \citet{subbaswamy2019preventing} propose to iterate over many subsets of all variables to find such predictors, resulting in an exponential-time algorithm in the worst case scenario.}
% Even if there is no invariant predictor, in the worst-case scenario, they require exponential time to find such an outcome.
\par
% \cite{shpitser2008dormant} proposed a solution to a similar problem: Identifying the conditional independence statements in interventional distributions, which can be inferred using only observational data. %approach that is quite similar to the causal invariant prediction problem. 
 \cite{shpitser2008dormant} proposed a solution to a similar problem in a different context, identifying the conditional independence statements in interventional distributions using only observational data.
Such conditional independences are called \emph{dormant independences}. They provide a complete algorithm for finding %detecting conditional independencies that hold in interventional distributions, namely 
dormant independence between two sets of variables. 
Although very relevant to the surgery estimator problem, their approach cannot be directly applied to solve the causal invariant prediction problem. 
% \blue{KENNETH: Although it is relevant to the surgery estimator problem, the study does not focus on finding invariant predictors.} %Their approach considers 
% They only consider variables that are connected through directed paths, but other variables with anti-causal and confounding paths can be useful in our invariant prediction task as well. 
% We establish a formal connection  %the dot 
% between their work and the invariant prediction problem and  propose a generalized solution for any causal graph by providing several invariant predictors starting from dormant independences.
We establish a formal connection and propose a generalized solution to the invariant prediction problem. 
% and  propose a generalized solution 
We provide several invariant predictors for any causal graph by starting from dormant independence.
\par
% In this paper, by leveraging a characterization of causal identifiability of conditional queries, and systematically combining the ideas from dormant independence with %we propose 
% a greedy feature selection step, we propose a polynomial time algorithm that outputs invariant predictors given the causal graph.  
In this paper, we propose a polynomial-time algorithm that outputs invariant predictors given the causal graph by leveraging a characterization of causal identifiability of conditional queries and systematically combining the ideas from dormant independence with a greedy feature selection step. 
Our algorithm is guaranteed to find at least one invariant predictor if it exists. %in polynomial time. 
We perform extensive experiments and the results illustrate that our algorithm gains significant computational efficiency compared to the  existing work and has competitive predictive performance.
Our contributions are summarized as follows:
\begin{enumerate}
    \item  We provide a graphical characterization of the identifiability of conditional causal queries and leverage it with greedy search to develop a sound algorithm called \emph{ID4IP} for finding invariant estimators in polynomial time
    given the causal graph structure.
    % when observed data is modeled as a DAG.
    \item We show that  ID4IP is sound.
    % and that it will output
    We also show that ID4IP outputs at least one graph surgery estimator anytime such an estimator exists.
    % \red{KENNETH: this needs attention. We only output one graph surgery estimator out of all graph surgery estimators we have found.}.
    \item We perform  experiments on 
    both synthetic and semi-synthetic data to illustrate that our algorithm has predictive performance that is comparable to a complete algorithm in the literature by~\cite{subbaswamy2019preventing}, and outperforms it when the runtime is limited.
\end{enumerate}
\section{Background}
\label{preliminaries}
In this section, we describe the necessary definitions and background knowledge required to introduce our approach.
\begin{definition}[Structural Causal Model (SCM) and Causal Graph]
An SCM is a tuple $\mathcal{M}= (\mathbf{V}, \mathbf{E}, \mathcal{N}, \mathcal{U}, \mathcal{F}, P(.) )$ that contains a set of observable variables $\mathbf{V}$,
a set of unobserved exogenous variables $\mathcal{N}$, a set of latent confounders $\mathcal{U}$ i.e., unobserved common causes of two observable variables, a set of functions $\mathcal{F}$ and a product probability distribution $\mathcal{P}(.)$ over $\mathcal{N}$ and  $\mathcal{U}$. 
Each observed variable is generated as $V_i=f_i(Pa_i, E_i, U_{S_i})$, where $f_i\in \mathcal{F}$, $Pa_i\subset \mathbf{V}$, $E_i \in \mathcal{N}$ and $U_{S_i}\coloneqq \{U_j:j\in S_i\}$ for some $S_i\subset \mathcal{U}$. 
Variables set $\mathbf{V}$ has a joint distribution $\mathcal{P}_{\mathbf{V}}$ implied by $\mathcal{F}$ and $\mathcal{P}(.)$.
%
% \mr{
% $\mathcal{F}$=$\{f_{V_1}, f_{V_2},..,f_{V_n}\}$ is the set of functions that generate each observed variable from other observed variables as $V_i=f_i(Pa_i, E_i, U_{S_i})$, where $Pa_i\subset \mathcal{V}$, and $U_{S_i}\coloneqq \{U_j:j\in S_i\}$ for some $S_i\subset \mathcal{U}$. 
% % such that $f_{V_i}$ is a functional mapping from  
% $\mathcal{P}(.)$ is a product probability distribution over the exogenous $\mathcal{N}$ and latent variables $\mathcal{U}$.
% The set of observable variables $\mathcal{V}$ has a joint distribution, 
% $\mathcal{P}_{\mathcal{V}}$ implied by $\mathcal{P}(.)$ and $\mathcal{F}$.}
%
%
% The causal functions of the 
\par
An SCM induces
% can be summarized into 
a directed acyclic graph called a causal graph, $G=(\mathbf{V}, \mathbf{E})$. Here $\mathbf{V}$ is the set of observable nodes and $\mathbf{E}$ is the set of directed edges. For any pair $V_i, V_j$, a directed edge $V_i \rightarrow V_j \in \mathbf{E}$ indicates that $V_i$ is a parent of $V_j$, i.e., $V_i\in Pa(V_j)$ and $V_j$ is a child of $V_i$, i.e., $V_j \in Ch(V_i)$ if and only if $V_j$ is in the domain of $f_{V_i}$. There exists a bi-directed edge $V_i \leftrightarrow V_j \in \mathbf{E}$ in $G$ if $V_i$ and $V_j$ share a latent confounder. $An(V)$ and $De(V)$ represent the ancestors and descendants of $V$ respectively. 
% Any node $V_i, V_j, V_k$ are neighbors of $Nbr(V)$ 
$Nbr(V)$ represents the nodes that are either parent, children of $V$, or share a bi-directed edge with $V$. For a variable set $\mathbf{V}$, $Pa(\mathbf{V})= \{Pa(V_i) \}_{V_i\in \mathbf{V}} \setminus  \mathbf{V}$. $Ch(\mathbf{V})$ also follows the same. However, $An(\mathbf{V}), De(\mathbf{V})$ and $Nbr(\mathbf{V})$ has set $\mathbf{V}$ included. We let $G_{S}$ to denote an induced subgraph of $G$ over any subset $S$ of node $\mathbf{V}$, 
$G_{\overline{S}}$ be the graph obtained by removing the incoming edges to $S$ from $G_{S}$,
% $G_{\overline{S}}$ be $G_{S}$ with all incoming edges of $S$ removed,
and $G_{\underline{S}}$ be $G_{S}$ with all outgoing edges of $S$ removed.
We define an intervention as $\DO(x)$ where $\DO(x)$ replaces $f_{X}$ with the equation $X=x$ and in other functions where $X$ occurs. We represent the observed distribution  after such an intervention as $\mathcal{P}_{x}(\mathbf{V})$ and the causal graph as $G_{\overline{X}}$. Let $\langle X, Y, Z \rangle$ be any consecutive triple along a path $p$. $Y$ is a \textit{collider} on $p$ if both edges are into $Y$. Otherwise, $Y$ is a \textit{non-collider} on $p$. 



\end{definition}
\begin{definition}[d-separation]
In a DAG, a path $p$
between vertices $X$ and $Y$ is \textit{d-connecting (active)}
relative to a set of vertices $\mathbf{Z} (X, Y \not \in \mathbf{Z})$ if
$(i)$ every non-collider on $p$ is not in $\mathbf{Z}$ and $(ii)$ every collider on $p$ is an ancestor of some $Z \in \mathbf{Z}$.
% \begin{itemize}
%     % \item Every non-collider on $p$ is not a member of $Z$.
%     % \item \red{Some} colliders on $p$ is an ancestor of some member of $Z$.
%     % \item  \blue{Every node $S$ following the pattern: $R\rightarrow S \leftarrow T, R \leftrightarrow S \leftarrow T,$ or $R \leftrightarrow S \leftrightarrow T$ (called colliders) on $p$ is an ancestor of some member of $Z$.}
%     \item Every non-collider on $p$ is not in $\mathbf{Z}$  and
%     \item every collider on $p$ is an ancestor of some $Z \in \mathbf{Z}$.
% \end{itemize}
If there is no d-connecting path between $X$ and $Y$ relative to $\mathbf{Z}$, we say $X$ and $Y$ are \textit{d-separated} relative to $\mathbf{Z}$, denoted as $(X \indep Y|\mathbf{Z})_{G}$.   
\end{definition}
%
\begin{definition}[Causal Effect Identifiability~\cite{shpitser2012identification}]
\label{def:identifiability}
Let $\mathbf{X}, \Y, \Z$ be disjoint sets. The causal effect of an action $\DO(\mathbf{x})$ on a set of variables $\mathbf{Y}$ in a given context $\mathbf{z}$ is said to be identifiable from $P$ in $G$ if $P_{\mathbf{x}}(\mathbf{y|z})$ is (uniquely) computable from
$P$ in any causal model that induces the causal graph.
% $G$.
%fixed for x but no more upper case. The definition is taken from the cited article.
\end{definition}
%
%
%
% Check : Nonlinear Invariant Risk Minimization: A Causal Approach- if you want definition of invariant predictors.
\emph{\textbf{Distribution shifts:}}
% The problem of shifting 
Distribution shifts refer to the changes
between training conditions and deployment conditions that prevent the generalization of machine learning and statistical models.
% can be referred to as distribution shift. 
% \blue{KENNETH: Distribution shifts refer to the changes between the probability distributions of the training and testing data.} 
For a set of features $\mathbf{X}$ and a target variable $Y$, distribution shift can be categorized into sub-groups~(\cite{zhang2013domain}) based on assumptions about the training domain and test domain. For example:
% For a set of features $\mathbf{X}$ and target $Y$, the distribution shift can be categorized into sub-groups~(\cite{zhang2013domain}):
$i)$ $P(Y)$ changes while $P(Y|\mathbf{X})$ stays fixed (target shift), $ii)$$P(Y|\mathbf{X})$ changes while $P(Y)$ stays fixed (conditional shift), $iii) $only  $P(\mathbf{X})$ changes (covariate shift).
 To prevent failure driven by distribution shifts, 
 we observe the relationships among dataset variables 
 % we should observe how each variable in the dataset is related to 
 % \red{related to not connected to}
 % each other 
 and how the dataset is generated. 
 % A causal graph can be used for this purpose representing the underlying data-generating process. 
% To prevent a distribution shift we can identify the sources where the shift occurs in the causal graph.
% \red{Please write this in a complete sentence: 
% We can identify the sources where the shift occurs 
% in the causal graph to prevent distribution shifts. 
One way is to model the underlying data-generating process as a causal graph and identify the sources where the shift occurs in the graph.
% \blue{
% KENNETH: replace "to protect against...distribution shifts" with:
% We should understand the relationships among the variables in the data to prevent failure driven by distribution shifts. One way is to model the underlying data-generating process as a causal graph and identify the sources where the shift occurs in the causal graph. }
% which models the underlying data-generating process.
% The benefit of using causal graphs is that such graphical representation encodes the conditional independence relations among the observed variables.
\par
For example, in Figure~\ref{fig:motivating-example}, 
% we assume the feature $H$ is responsible for the shift between datasets. 
% \blue{KENNETH: 
assume that we are given the knowledge that the distributions of $H$ change between the training and testing data.
% }
We address this in the causal graph by adding an auxiliary variable called selection variable $S$~(\cite{pearl2011transportability}) pointing to $H$.
Suppose we want to predict $N$ from $A$, $X$, and $I$. 
% \blue{KENNETH: Suppose we want, remove "would like"} 
The Bayes-optimal predictor models the conditional distribution $P(N|A, I,X)$. It is easy to see that there exists a d-connecting path from $S$ to $N$ relative to the set $\{A, I, X\}$ as $X$ is a child of $H$ such that 
$N$ is conditionally dependent on $S$ given $A, I, X$. 
As a result, this conditional distribution changes in the target environment making the model prone to distribution shift.
Therefore, a systematic approach to ensure invariant prediction is to intervene on $H$ and use conditioning sets that render target $N$ independent from $S$ producing an identifiable conditional query. This method is known as graph surgery~(\cite{subbaswamy2019preventing}). 
% For that purpose, one approach would be to intervene on the \emph{Hospital equipment} variable and use that query for training purposes if the query is identifiable. This method is known as graph surgery~(\cite{subbaswamy2019preventing}). 

% Therefore, if we wish to train a stable model that stays invariant to the distribution shift from the features of the causal graph, we have to ensure that the target variable $N$ is d-separated from the selection variable $S$.
% \par
% Now, suppose, we wish to utilize the features: $A$, $X$, and $I$ to train a model that can predict $N$. For that purpose, if we condition on them,
% \blue{a d-connecting path, namely $\langle H,I,N\rangle$, is formed between $N$ and $S$}
% there appears a d-connecting path from the selection variable to the target 
% which makes the model prone to distribution shift.
% To resolve this issue, one approach would be to intervene on the \emph{Hospital equipment} variable and use that query for training purposes if the query is identifiable. This method is known as graph surgery~(\cite{subbaswamy2019preventing}). 
% %
% %
\begin{figure}[t!]
  \centering
\includegraphics[width=0.6\linewidth]{Figures/algorithm-details/graph_surgery_upd.pdf}
  \caption{Selection diagram: Causal Graph with $S$}\label{fig:graph-surgery-example}
\end{figure}
%
\begin{definition}[Graph surgery estimator]
\label{def:graph-surgery-estimator}
    % \mathbf{(Graph surgery estimator)} 
    Let $S$ be the shift variables, and $Y$ be the target variable. For any subsets $\mathbf{Q}, \mathbf{W}\subseteq \mathbf{V}$, if $(Y \indep S|\mathbf{W})_{G_{\overline{\mathbf{Q}}}}$ and $P(Y|\DO(\mathbf{Q}), \mathbf{W})$ is identifiable in $G$ and $P(Y|\DO(\mathbf{Q}), \mathbf{W})\ne P(Y)$, then $P(Y|\DO(\mathbf{Q}), \mathbf{W})$ is called a \textit{graph surgery estimator}.
\end{definition}
We will illustrate the concept of graph surgery estimator and other graphical definitions using the graph in Figure \ref{fig:graph-surgery-example}. 
% as a running example \red{as an example}. 
In this example, $P(Y| \DO(X), H, T)$ is a graph surgery estimator because $P(Y|\DO(X),H, T)$ is identifiable and $(Y\indep S|T,H)_{G_{\overline{X}}}$. 
Note that $P(Y| \DO(X),H,T) \ne P(Y|X, H,T) \ne P(Y) 
% =\frac{\sum_{Q}P(K,Y|H,Q)P(Q)}{\sum_{Q}P(K|H, Q) P(Q)} \ne P(Y)
$, i.e., the intervention $\DO(X)$ has non-zero effect on the distribution of $Y$.
The purpose of a graph surgery estimator is to find a predictor that is invariant across environments by shielding off the causal effects of $S$ to $Y$. It uses interventional queries that can be computed from observational distribution as invariant predictors that are not available 
by only checking d-separation.
% by only using d-separation.
In Figure \ref{fig:graph-surgery-example}, 
suppose we use $H$ for predicting $Y$. Thus the target variable is distributed as $P(Y|H)$.
% if we use $H$ for prediction assuming it is an important feature, we would need to condition on it. 
%
After conditioning on $H$, we search for a feature set $K$  to further use, keeping $Y$ d-separated from $S$. However, there does not exist any such $K$ for invariant predictors
% After conditioning on $H$, there does not exist any set $K \subset \mathbf{V}$ such that we can acquire an invariant predictor where $K$ d-separates $S$ from $Y$ 
unless we utilize a graph surgery estimator. 
% \blue{KENNETH: I think we should stick with what prof wrote to start with "Suppose we search for.."}
Similarly, in Figure~\ref{fig:motivating-example}, $P(N|\DO(H),A, I,\emph{X})$ is a graph surgery estimator since the query is identifiable and $(N\indep S|A, I,X)_{G_{\overline{H}}}$. 
% \red{KENNETH: I think we should ask Prof whether we should keep the text in figure 1 since you have already mentioned what each variable represents in the paragraph.}
% \par
One crucial condition of the graph surgery estimator is that the interventional distributions have to be identifiable from observational training data. 
Next, we provide several definitions that are used in identifiability, which will also be useful for our algorithm. 
% Thus,
% we introduce the following concepts required for identifiability.
\begin{definition}[C-component]
\label{def:ccomponent}
    % \mathbf{(C-component)} 
    A graph $G$ where any pair of observable nodes is connected by a bidirected path is called a c-component (confounded component).
\end{definition}
\begin{definition}[C-tree]
\label{def:ctree}
Let $G$ be a C-component such that each vertex of $G$ has at most one child, and only one vertex $Y$ (called the root) has no children. Then $G$ is called a $Y$-rooted C-tree.
\end{definition}

\begin{definition}[C-forest]
\label{def:cforest}
Let $G$ be a C-component such that each vertex of $G$ has at most one child except a non-empty vertex set $\mathbf{Y}$ that has no children. Then $G$ is called a $\mathbf{Y}$-rooted C-forest. 
\end{definition}
Note that every C-tree is also a C-forest, but the converse is not true. In Figure \ref{fig:graph-surgery-example}, $G_{T,Y, Q,R,Z}$ is a $\{Y,Z\}$-rooted C-forest, but it is not a C-tree. Additionally, $G_{Q, R, Z}$ is both a $Z$-rooted C-tree and a C-forest as it has at least one node with no children and other nodes with exactly one child.
However, $G_{Z, R, W}$ fits neither definition as $W$ does not belong to the same C-component of $\{R, Z\}$.
% Additionally, $G_{Q, H, K, Y}$ is not a C-forest as $H$ does not belong to the same C-component of $\{Q, K, Y\}$.
We are now ready to use these concepts for understanding a particular graphical structure that is related to causal identifiability and is also used often in the causal discovery literature. 
% \red{KENNETH: should we cite some literature if we say so?}. 
\begin{definition}[Inducing paths for sets]
Let $\mathbf{X}, \mathbf{Y}$ be sets of variables in $G$. A path $p$ between $\mathbf{X}$ and $\mathbf{Y}$ is called an inducing path if every non-endpoint vertex is a collider on the path and an ancestor of either $\X$ or $\Y$.
% \begin{itemize}
%     \item Every non-endpoint vertex is a collider on the path and an ancestor of either $\X$ or $\Y$.
%     % The path forms a collider for every non-endpoint vertices
%     % Every non-endpoint vertex is a collider on the path and an ancestor of either endpoints. 
% \end{itemize}
\end{definition}
\begin{definition}[Hedge]
\label{def:hedge}
      Let $\X, \Y, \mathbf{W}$ be sets of variables in $G$.  Let $F,F'$ be $\mathbf{R}$-rooted C-forests in $G$ such that $F \cap \X \ne \emptyset$, $F' \cap \X = \emptyset $ and $F' \subset F$, for some $\mathbf{R} \subset An(\mathbf{Y})_{G_{\overline{\X}}}$. Then $F$ and $F'$ form a hedge for $P(\mathbf{Y}|\DO(\mathbf{X}))$.
\end{definition}
For instance, in Figure~\ref{fig:graph-surgery-example}, if $R=\{Z\}$  then $F=\{Q,R,Z\}$ and $F'=\{R,Z\}$ form a hedge for $P(Z|\DO(Q))$.
%
%%%%%%% SECTION %%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%
%
\section{Finding graph surgery estimators in Polynomial Time} 
% In this section, we describe the details of our approach to finding graph surgery estimators. First, 
% we introduce the theoretical results of causal identifiability that lead to the development of the algorithm. Then, we discuss the workings of our proposed algorithm. We leave most of the proofs in the appendix section \ref{appex:proofs}.
% We extend the hedge condition to a generalized hedge condition for conditional queries.
In this section, we describe the details of our approach of finding graph surgery estimators. First, 
we introduce the theoretical results of causal identifiability that lead to the development of the algorithm. Then, we discuss the workings of our proposed algorithm. We leave most of the proofs to appendix Section A.
% \ref{appex:proofs}.

We extend the hedge condition to a generalized hedge condition for conditional queries.
\begin{definition}[Generalized Hedge Condition]
\label{def:generalized-hedge}
    % \mathbf{(Generalized Hedge Condition)} 
    Let $\mathbf{X}, \mathbf{Y}, \mathbf{W}$ be sets of variables in $G$. Let $\mathbf{Z} \subseteq \mathbf{W}$ be the maximal set such that $P(\mathbf{Y}|\DO(\mathbf{X}), \mathbf{W}) = P(\mathbf{Y}|\DO(\mathbf{X},\mathbf{Z}), \mathbf{W} \setminus \mathbf{Z})$.  Let $F,F'$ be $\mathbf{R}$-rooted C-forests in $G$ such that $F \cap (\mathbf{X} \cup \mathbf{Z}) \ne \emptyset$, $F' \cap (\mathbf{X}\cup \mathbf{Z}) = \emptyset $, and $F' \subset F$, and $\mathbf{R} \subset An(\mathbf{Y}\cup (\mathbf{W}\setminus \mathbf{Z}))_{G_{\overline{\mathbf{X}\cup \mathbf{Z}}}}$. Then $F$ and $F'$ is said to form a hedge for $P(\mathbf{Y}|\DO(\mathbf{X}), \mathbf{W})$.
\end{definition}
% We can use Figure \ref{fig:graph-surgery-example} for illustration of Definition \ref{def:generalized-hedge}. Let $\mathbf{X}=\{Q,T\}, \mathbf{W}=\{Z,W\}, \mathbf{Y}=\{Y\}$.
% By rule 2 of do-calculus \cite{pearl2009causality}, $P(Y| do(Q,T), Z, W) = P(Y|do(Q,T, W), Z)$. We can let $F=\{Q,R,Z,T,Y\}$ and $F'=F\setminus \{Q,T,R\}$ so that $F,F'$ are $\{Y,Z\}$-rooted C-forests to form a hedge for $P(Y|Z,W, do(Q,T)$.
% \begin{figure}[H]
%   \centering
% \includegraphics[width=0.7\linewidth]{Figures/algorithm-details/graph_surgery_upd.pdf}
%   \caption{Selection diagram: Causal Graph with $S$}
% \end{figure}
We can use Figure \ref{fig:graph-surgery-example} to illustrate Definition \ref{def:generalized-hedge}. Let $\mathbf{X}=\{Q,T\}, \mathbf{W}=\{R,W\}, \mathbf{Y}=\{Y,Z\}$.
By rule 2 of do-calculus \cite{pearl2009causality}, $P(Y,Z| \DO(Q,T),R, W) = P(Y,Z|\DO(Q,T, W),R)$. We can let $F=\{Q,R,Z,T,Y\}$ and $F'=F\setminus \{Q,T,W\}$ so that $F,F'$ are $\{Y,Z\}$-rooted C-forests to form a hedge for $P(Y,Z|\DO(Q,T),R,W)$.
The following theorem describes the relationship between a hedge and causal identifiability. 
% The importance of the hedge structure is that it provides
% graphical information about the identifiability of any causal queries as shown by Theorem \ref{thm:conditional-hedge-iff-non-id}.
%
% \begin{lemma} \label{lem:uid->hedge}
%     If there is no hedge for $P(\mathbf{Y}|\DO(\mathbf{X}))$, then $P(\mathbf{Y}|\DO(\mathbf{X}))$ is identifiable in $G$.
% \end{lemma}
% %
% %
% \begin{theorem} \label{thm:hedge-iff-uid} 
%     There exists a hedge for $P(\mathbf{Y}|\DO(\mathbf{X}))$ if and only if $P(\mathbf{Y}|\DO(\mathbf{X}))$ is unidentifiable in $G$
% \end{theorem}

% \begin{proof}
%     By Lemma \ref{lem:uid->hedge}  and Theorem 4 in \cite{shpitser2006identification}, the result follows.
% \end{proof}
%
\begin{theorem}\label{thm:conditional-hedge-iff-non-id}
     There exists a hedge for $P(\mathbf{Y}|\DO(\mathbf{X}),\mathbf{W})$ according to the generalized hedge condition if and only if $P(\mathbf{Y}|\DO(\mathbf{X}),\mathbf{W})$ is unidentifiable in $G$.
\end{theorem}
%
%
%
%
% ---------------
% When we are searching for invariant predictors, we have to ensure their corresponding causal queries are identifiable. To achieve this goal, we start characterizing the neighboring variables of the target and draw a connection with the hedge structure such that intervention on them will result in non-identifiability. We characterize them as follows.
\begin{definition}[Ancestral Confounded Set]
\label{def:MACS}
    % \mathbf{(Ancestral Confounded Set (ACS))} 
    Let $Y$ be a variable in $G$. A set $\mathbf{K}$ is ancestral confounded (ACS) for $Y$ if $\mathbf{K} = An(Y)_{G_{\mathbf{K}}} = C(Y)_{G_{\mathbf{K}}}$. We call an ACS $T_Y$ maximum ACS (MACS)  if $T_Y$ is the largest set such that $\mathbf{K}=An(Y)_{G_{\mathbf{K}}} = C(Y)_{G_{\mathbf{K}}}$.
\end{definition}
In Figure \ref{fig:graph-surgery-example}, for variable $Z$, $\mathbf{K} = An(Z)_{G_{\mathbf{K}}} = C(Z)_{G_{\mathbf{K}}}
= \{R, Z\}$ is an ACS for $Z$ while $\mathbf{K'}=\{Q,R,Z\}$ is the largest set satisfying the same ACS condition. Thus, $T_{Z} = \mathbf{K'}= \{Q,R,Z\}$ is the MACS for $Z$. 
One special property about MACS is that it is unique for any variable in $G$ by Theorem 4 in \cite{shpitser2008dormant}.
% Throughout this work, we will denote the MACS of a set $\mathbf{K}$ in $G$ as $T_{\mathbf{K}}$. 
% % The significance of the MACS is that it can help determine whether a causal query is identifiable in a given causal graph.  - Not completely true.
% The significance of the MACS is that it can suggest specific causal queries that are non-identifiable in a given causal graph. 
% However, if we wish to learn about non-identifiable causal effects on a set of variables, we need to generalize MACS for a variable set. Thus, we need to define another graphical structure known as AC-component to find the MACS of the whole variable set.
Throughout this work, we will denote the MACS of a set $\mathbf{K}$ in $G$ as $T_{\mathbf{K}}$. The significance of the MACS is that it helps determine whether a causal query is identifiable in a given causal graph. Next, we need to define another graphical structure known as AC-component for finding MACS.
\begin{definition}[AC-component]
    A set $\mathbf{Y}$ of nodes in $G$ is an ancestral confounded component (AC-component) if $\mathbf{Y}$ is a singleton e.g. $\mathbf{Y} = \{Y\}$ or $\mathbf{Y}$ is a union of two distinct AC-components $\mathbf{Y}_{1}, \mathbf{Y}_{2}$ which have ancestral confounded sets $S_1, S_2$, respectively, and $S_1, S_2$ are connected by a bidirected arc.
\end{definition}
%
% \begin{figure}[H]
%   \centering
% \includegraphics[width=0.7\linewidth]{Figures/algorithm-details/graph_surgery_upd.pdf}
%   \caption{Selection diagram: Causal Graph with $S$}
% \end{figure}
For example, in Figure \ref{fig:graph-surgery-example}, $\{Y, Z\}$ is an AC-component because $\{Z\}$ is an ACS for $Z$ and $\{Y\}$ is an ACS for $Y$ and $Y$ and $Z$ are connected by a bidirected arc. We can leverage the algorithm by \cite{shpitser2008dormant} called \textbf{Find-MACS-on-set}
(see Algorithm 2 in Section B.2)
% \ref{alg:find-macs-on-set}) 
to find the MACS of a set in $G$. The following lemma describes the relationship between a MACS and an important graphical structure related to causal identifiability. 

\begin{lemma}
\label{lem:MAC=Y-rooted-Ctree} Let $\Y = \{Y\}$.
The output of \textbf{Find-MACS-on-set}$(G, \Y)$ is the MACS of $Y$. The MACS of $Y$ is a $Y$-rooted C-tree.
\end{lemma}
% \begin{lemma}
% The output of \mathbf{Find-MACS-on-set$(G,Y)$} is a $Y$-rooted C-tree.
% \end{lemma}


\subsection{Relationships with Graph Surgery Estimators}
We now explain how the previous section relates to finding a graph surgery estimator. Theorem \ref{thm:impossibility-of-invariant-predictor} and Theorem \ref{thm: parents-of-T_y-existence}  imply that knowing the MACS for a target variable can help identify some causal queries that will not be graph surgery estimators. 
If selection variable $S$ has a child $W$ in a $Y$-rooted C-tree and $W$ forms a hedge for $P(Y|do(W))$, then there is no graph surgery estimator in $G$.
%
% If a child $W$ of the selection variable $S$ is in a $Y$-rooted C-tree with which we can form a hedge for $P(Y|do(W))$, then there is no graph surgery estimator in $G$.
\begin{theorem}\label{thm:impossibility-of-invariant-predictor}
     For some $W \in Ch(S)$, if there exists a hedge for $P(Y|do(W))$, then for any $\mathbf{H}, \mathbf{J} \subseteq \mathbf{V}$,  we have $(Y\not \indep S |\mathbf{J})_{G_{\overline{H}}}$ or $P(Y|do(\mathbf{H}), \mathbf{J})$ is unidentifiable in $G$.
\end{theorem}

\begin{theorem} \label{thm: parents-of-T_y-existence} If the selection variable $S$ is a parent of MACS $T_{Y}$, then there is no graph surgery estimator in $G$.
\end{theorem}



%
Furthermore, we can systematically leverage the MACS for the target variable to find graph surgery estimators. Theorem \ref{thm:valid-stable-estimator-v2} says that we can find some graph surgery estimators by finding the union of the MACS of the target and the MACSs of some children of the target. The intuition is that we can find some graph surgery estimators by intervening on the parents of the MACs whenever the selection variable $S$ is not a parent of those MACSs. Although Theorem \ref{thm:valid-stable-estimator-v2} implies that we can find graph surgery estimators by using the MACS of the subsets of the children, we only use the largest subset i.e. picking $\mathbf{K} = \mathcal{H}$ (denoted in Theorem \ref{thm:valid-stable-estimator-v2}) to incorporate as many predictors as possible in our algorithm.  

%
\begin{theorem} \label{thm:valid-stable-estimator-v2}
Let $T_Y$ be the MACS of $Y$ in $G$, $\mathcal{H}\coloneqq \{H: H\in Ch(Y)$, $Pa(T_H)\not\ni S\}$ and $T_J\coloneqq \bigcup_{H\in \mathbf{K}} T_H$ for any $\mathbf{K}\subseteq \mathcal{H}$, where $T_{H}$ is the MACS with respect to the variable $H$. Let $\mathbf{D}= Pa(T_{Y}\cup T_{J})$.  If $S$ is not a parent of $T_Y$, then $P(Y|do(\mathbf{D}), \mathbf{K}, \mathbf{W})$ is identifiable in $G$ and $(Y\indep S |\mathbf{W},\mathbf{K})_{G_{\overline{\mathbf{D}}}}$ for any $\mathbf{W} \subseteq (T_{Y}\cup T_J) \setminus (Y\cup \mathbf{K})$.
\end{theorem}
%
%
%
\begin{corollary}
\label{cor:valid-stable-estimator-v1}
Let $T_Y$ be the MACS of $Y$ in $G$ and $\mathbf{D}= Pa(T_{Y})\setminus T_{Y}$. If $S$ is not a parent of $T_Y$, then
$P(Y|do(\mathbf{D}),\mathbf{W})$ is identifiable in $G$ and $(Y \indep S | \mathbf{W})_{G_{\overline{\mathbf{D}}}}$ for any $\mathbf{W} \subseteq  T_{Y} \setminus Y$ 
\end{corollary}
%
%
In addition, we search for the bidirected neighbors of $Y$ that are not in any MACS of the children of the target or in the MACS of the target. There are two reasons for doing so. First, we find the MACs of these bidirected neighbors to increase the number of graph surgery estimators output by our proposed algorithm. Second, finding the MACS of the bidirected neighbors of $Y$ that are in any MACS of children of the target or the target itself can be inefficient due to duplicate searches for the same query.

\begin{theorem}
\label{thm:find-macs-on-bidirected-nbr-Y-and-Y} Let $T_Y$ be the MACS of $Y$ in $G$, $T_H$ be the MACS of any child $H$ of $Y$ in $G$. Define
\begin{align}
    T_{\mathbf{C}}&\coloneqq \bigcup_{H\in Ch(Y)} T_{H} \\
   \mathcal{Z} \coloneqq \{Z: Z \in (C(Y) &\cap Nbr(Y) ) \setminus (T_{Y}\cup T_{\mathbf{C}}) \\
    s.t. Pa(T_{Y \cup Z})\not\ni S \} \notag\\
    T_B&\coloneqq \bigcup_{Z\in \mathbf{M}} T_{Y\cup Z}
\end{align}  for any $\mathbf{M}\subseteq \mathcal{Z}$  where $T_{Y \cup Z}$ is the MACS for the set $(Y\cup Z)$. Let $\mathbf{D}= Pa(T_{B})$.  If $S$ is not a parent of $T_Y$, then $P(Y|do(\mathbf{D}), \mathbf{M}, \mathbf{W})$ is identifiable in $G$ and $(Y\indep S |\mathbf{W},\mathbf{M})_{G_{\overline{\mathbf{D}}}}$ for any $\mathbf{W} \subseteq (T_{B})\setminus (Y\cup \mathbf{M})$.
\end{theorem}
%

%

%
%
\subsection{Algorithm Details}
%
%
%
%
%
 Our approach begins with Algorithm~\ref{alg:id4ip}:\textbf{ID4IP}. It takes the selection variable $S$, the target variable $Y$, and the causal graph $G$ as input, and outputs at least one invariant predictor if any exists.
In the beginning, we initialize two sets $P_{set}$ and $L_{set}$ for storing the invariant predictors and their corresponding losses.
At line~\ref{lineNum:findMACs-TY}, we call Algorithm 2 in Section B.2:
% ~\ref{alg:find-macs-on-set}
\textbf{Find-MACS-on-set} as a sub-routine which finds the MACS $T_Y$ (Definition~\ref{def:MACS}) of the target variable $Y$. If $S$ is a parent of  $T_Y$, the algorithm returns \textbf{FAIL} at line~\ref{lineNum:failure1a}-\ref{lineNum:failure1b}.
This indicates that there exists no graph surgery estimator in $G$.
\begin{algorithm}[t!]
     \caption{Greedy-Eval($\Y, \X, \mathbf{W}$)}
     \label{alg:greedy-eval}
     \begin{algorithmic}[1]
      \STATE {\textbf{Input:}  A set of targets $\Y$, an intervention set $\X$, a conditioning set $\mathbf{W}$} 
        \STATE { \textbf{Output:} $P$, a causal query that corresponds to the lowest training loss $L$ among the searched queries.}
             \STATE $A_{0} = \mathbf{Y}$ \COMMENT{A cumulative array s.t., $A_i \subset A_{i+1}$}
                 \FOR{$i \in 0 \ldots (|\mathbf{W}| -1)$} \label{line:loop-over-w}
                \STATE $K = \argmin \limits_{J\in \mathbf{W} \setminus A_{i}} \mathbf{computeLoss}(A_{i} \cup \{J\}, \X ) - \mathbf{computeLoss}(A_{i}, \X)$
                \label{line:argmin-K}
                \STATE $A_{i+1} = A_{i} \cup \{K\}$ \label{line:AunionK}
            \ENDFOR
             \STATE \textbf{Return} $P(A_{|\mathbf{W}|}|do(\X))$ \label{line:alg1-return}
             \COMMENT{Value of the last index.} 
     \end{algorithmic}
\end{algorithm}
\begin{algorithm}[t!]
  \caption{addBestIP($S, Y, T_{Y},
  P_{{set}},L_{{set}},
  \mathbf{Z},  \mathbf{R}$)}\label{alg:addBestIP}
  \begin{algorithmic}[1]
  \STATE {{\bfseries Input:} Selection variable $S$, Target $Y$, C-tree $T_Y$, Predictors and Loss terms $P_{{set}}, L_{{set}}$, Variable set $\mathbf{Z}$,  Additional Roots $\mathbf{R}$.}
  \STATE {{\bfseries Output:} A set of predictors $P_{set}$, a set of losses $L_{set}$, a set of MACS $T_{visited}$}
  \STATE $T_{J} = T_Y$; $\mathcal{H} =\{Y\}$; $T_{visited} = \emptyset$
  \IF{$\mathbf{Z} \ne \emptyset$} \label{line:Z-empty}
    % \IF{$S \notin An(Y)$}
    % \STATE \mathbf{Return FAIL}
    % \ENDIF
    \FOR{$H \in  \mathbf{Z}$} \label{line:H-in-Z}
         \STATE $T_{H} =$ \textbf{Find-MACS-on-set} $(G, \mathbf{R} \cup \{H\})$ \label{line:T_H}
         % \IF{$Q$ is \textbf{True}}
        \STATE $T_{visited} = T_{visited} \cup T_{H}$
         % \ENDIF
        \IF{$S \not \in Pa(T_{H})$} \label{line:s_not_in}
         \STATE $T_{J} = T_{J} \cup T_{H}$ \label{line:Tj-union}
         \STATE $\mathcal{H} = \mathcal{H} \cup H$ \label{line:h-union}
        \ENDIF
    \ENDFOR
      \ENDIF
      \STATE $P, L = $ \textbf{Greedy-Eval}$(\mathcal{H}$, $ Pa(T_{J}), T_{J}\setminus \mathcal{H}$) \label{line:greedy-eval}
    \IF{$P \notin P_{{set}}$}
      \STATE $P_{{set}}.append(P); L_{{set}}.append(L)$
     \ENDIF
     \STATE {\bfseries Return: $P_{{set}}, L_{{set}}, T_{visited}$}
 \end{algorithmic}
\end{algorithm}
\par At lines~\ref{lineNum:search T_y},~\ref{lineNum:findchild} and~\ref{lineNum:findbidirectednbr}, we call the sub-routine Algorithm~\ref{alg:addBestIP}: \textbf{addBestIP}.
% and empty sets as parameters.
This sub-routine takes selection variable $S$, target $Y$, the C-tree $T_Y$, $P_{set}, L_{set}$, a variable set $Z$ and roots $R$ as inputs. The first five parameters stay the same for all these subroutine calls while $\mathbf{Z}$ and $\mathbf{R}$ change. Conditioning on the variable set $\mathbf{Z}$ allows us to find more invariant predictors. $\mathbf{R}$ is either empty or contains the target $Y$.
% and we find c-forests rooted at $R$ and some other variables. 
\par
Inside the \textbf{addBestIP} sub-routine, we initialize a set $T_J$ with the $Y$ rooted C-tree, $\mathcal{H}$ with the target variable and $T_{visited}$ as an empty set which will track all visited C-tree nodes so that the algorithm does not have to consider those nodes again.
If $\mathbf{Z}$ is non-empty, for each variable $H\in \mathbf{Z}$, we call the sub-routine \textbf{Find-MACS-on-set} for $\mathbf{R}\cup {H}$ (line~\ref{line:T_H}). This sub-routine returns a MACS $T_H$ that contains a C-forest rooted at $\mathbf{R}$ and ${H}$. We store $T_H$ in $T_{visited}$. However, we only consider those c-forests for which $S\notin Pa(T_H)$ holds (lines~\ref{line:s_not_in}-~\ref{line:h-union}). We combine the MACS $T_H$ with the MACS of previous variables, in $T_J$, and save $H$ in $\mathcal{H}$ (lines~\ref{line:Tj-union}-~\ref{line:h-union}). 
After the loop ends, we call the sub-routine Algorithm~\ref{alg:greedy-eval}: \textbf{Greedy-Eval} at line~\ref{line:greedy-eval} and send the three sets $\mathcal{H},  Pa(T_J)$ and $T_J\setminus \mathcal{H}$ as inputs.
This sub-routine returns the causal query corresponding to the lowest validation loss among other queries found by our algorithm.
If this is a new query that we have not found yet, we add it and its corresponding loss to $P_{set}, L_{set}$ and return them along with $T_{visited}$ from the sub-routine.
\par
Now, back to our initial Algorithm~\ref{alg:id4ip}, all three calls at lines ~\ref{lineNum:search T_y},~\ref{lineNum:findchild} and~\ref{lineNum:findbidirectednbr} have common first five parameters.
% $Y$, C-tree $T_Y$.
The function call at line 7 finds the invariant predictors that use ancestors of $Y$ as inputs.
% \red{a predictor is not a node to be ancestor of something. please fix}
 For this case, lines~\ref{line:Z-empty}-~\ref{line:h-union} of \textbf{addBestIP} sub-routine will get skipped. 
To better understand our algorithm till this step, we can look at Figure~\ref{fig:alg-example} where we have MACS $T_Y=\{E,Y\}$. Thus the step at line~\ref{lineNum:search T_y} will enlist the lowest loss causal query in the form of $P(Y|do(D),{H}), {H}\subseteq \{E\}$.
However, if there exists a bidirected edge between $D$ and $E$ then $T_Y$ would be $\{A,D,E,Y\}$, and $S$ would be a parent of this set. Thus we would have to return fail (line~\ref{lineNum:failure1a}). The reason is that we can not condition on any variables in $T_Y$ since there would exist inducing paths from S to Y. And we can not intervene on any variables in $T_Y$ as well since it would be non-identifiable. This illustrates a situation when there exists no invariant predictor.
\begin{algorithm}[t!]
  \caption{ID4IP($S, Y, G$)}\label{alg:id4ip}
  \begin{algorithmic}[1]
    \STATE {\textbf{Input:}  Selection variable $S$, target $Y$, causal graph $G=(\mathbf{V},\mathbf{E})$}
    % with the lowest validation loss among all found estimators
    % signifying ID4IP cannot find any estimator
    \STATE {\textbf{Output:} An invariant predictor $P^\star$ or \textbf{FAIL} to indicate that there is no graph surgery estimator in $G$.}
     \STATE $P_{{set}} = \emptyset; L_{{set}} = \emptyset$
     \STATE $T_{Y}$ = \textbf{Find-MACS-on-set}$(G, Y)$ \label{lineNum:findMACs-TY}
     \IF{$S \in Pa(T_{Y})$} \label{lineNum:failure1a}
            \STATE \textbf{Return
            \textbf{FAIL}} \label{lineNum:failure1b}
%
    \ENDIF
    %  \STATE $P, L = $\textbf{Greedy-Eval}$(Y, Pa(T_{Y}), T_{Y}\setminus Y)$
    %  \STATE $P_{\mathbf{set}}.append(P); L_{\mathbf{set}}.append(L)$
    \STATE $P_{{set}},L_{{set}}, T_{\emptyset}=$ 
     \textbf{addBestIP}($S, Y, T_{Y}, P_{{set}},L_{{set}},  \emptyset,\emptyset$)\label{lineNum:search T_y}
     \STATE $P_{{set}},L_{{set}}, T_{\mathbf{Ch}}=$
     \textbf{addBestIP}($S, Y, T_{Y}, P_{{set}},L_{{set}}, Ch(Y), \emptyset$) \label{lineNum:findchild}
     \STATE $\mathbf{T} = T_{Y} \cup T_{\mathbf{Ch}}$ \label{lineNum:TycupTc}
     \STATE $P_{{set}},L_{{set}},T_C=$
     \textbf{addBestIP}($S, Y, T_{Y}, P_{{set}},L_{{set}}, (C(Y) \cap Nbr(Y)) \setminus \mathbf{T},  \{Y\}$)  \label{lineNum:findbidirectednbr}
     %-----------
     \IF{$P_{{set}} \ne \emptyset$}
     \STATE \textbf{Return }the predictor $P$ in $P_{{set}}$ corresponding to the lowest loss $L$ found in $L_{{set}}$ that have been searched
    %  \ELSE
     \ENDIF
      \STATE \textbf{Return FAIL}\label{lineNum:failure} 
 \end{algorithmic}
\end{algorithm}
%
For the function call at line~\ref{lineNum:findchild}, we send $\mathbf{Z}= Ch(Y)$ as a parameter. 
At Algorithm~\ref{alg:addBestIP}, lines~\ref{line:H-in-Z}-\ref{line:h-union}, we iterate over $\mathbf{Z}$ and find C-tree rooted at each of the children since $R=\emptyset$. We store these C-trees in $T_{visited}$. However, we can not utilize children for invariant predictors if $S$ is a parent of their C-tree. This way we can search for the invariant predictors that employ ancestors of the chosen $Ch(Y)$ for prediction since $Y$ becomes dependent on those ancestors after conditioning on the chosen $Ch(Y)$. 
In our example in Figure~\ref{fig:alg-example}, $Ch(Y)=\{C,Z\}$ and $T_C=\{A,C\}, T_Z=\{E,Y,Z\}$. We can not utilize child $C$ for invariant predictors since $S\in Pa(T_C)$. However, $Pa(T_Z)= \{D,W\}$ and we have $\mathcal{H}=\{Y,Z\}$. \textbf{Greedy-Eval} will choose one query from all queries in the form of $P(Y| do(D,W), Z, {H},), {H}\subseteq \{E\}$.
% and $T_J\setminus \mathcal{H} =[E]$.
\par
At line~\ref{lineNum:TycupTc} of \textbf{ID4IP}, we store the $Y$-rooted C-tree $T_Y$ and the C-trees rooted at the chosen children  $T_{Ch}$, returned from the call at the previous step in $\mathbf{T}$. 
Finally, at line~\ref{lineNum:findbidirectednbr}, we only consider specific bi-directed neighbors $\mathbf{N}$ of $Y$ that are not in $\mathbf{T}$, such that $\mathbf{N}=C(Y) \cap Nbr(Y) \setminus \mathbf{T}$.
The goal is to avoid computation of the overlapping queries among the MACS found during these 3 steps.
% that are neither parents nor children.
We send this set as a parameter of the \textbf{addBestIP} sub-routine so that we can find invariant predictors that utilize ancestors of the bi-directed neighbors in this set. Similar to children, we iterate over these neighbors to find more invariant predictors that can predict $Y$ after conditioning on that bi-directed neighbor. 
For this purpose, we find c-forests rooted at both $Y$ and some neighbor in $\mathbf{N}$ such that $S$ is not a parent of the found C-forest. Thus, we send root $R=\{Y\}$ as a parameter of the sub-routine, unlike the previous case when we sent $R=\emptyset$. For Figure~\ref{fig:alg-example}, $V$ is the bi-directed neighbor of $Y$. Here $T_V=\{Y,V,W\}$ and $Pa(T_V)= \{C,Z\}$. Therefore, the last \textbf{addBestIP} call will return the query with the lowest validation loss from a set of causal queries of the form $P(Y| do(C,Z), V,{H}),{H}\subseteq \{W\}$
% However, unlike the previous case, this time we need to find c-forests rooted at both $Y$ and some variables in $\mathbf{N}$ since .
\par
After these 3 function calls, we return the predictor from $P_{set}$ with the minimum validation loss. If our algorithm can not find any predictors even after these 3 steps, i.e., $P_{set}$ is empty, we return fail indicating that there exists no invariant predictor for this graph. This follows from Theorem~\ref{thm:gurantees-of-finding-invariant-predictor}.
    
\begin{figure}[t!]
  \centering
  \includegraphics[width=0.6\linewidth]{Figures/algorithm-details/alg-example.pdf}
  \caption{A causal graph inducing invariant predictors of the form $P(Y| do(D),{H})$, ${H}\subseteq \{E\}$ when we consider the C-tree $T_Y$. Similarly, we condition on $Ch(Y)=\{Z\}$ and utilize $T_Z$ for predictors as $P(Y| do(D,W), Z, {H})$, ${H}\subseteq \{E\}$. Finally we condition on bi-directed neighbors of $Y$ and utilize $T_{V,Y}$ for predictor $P(Y| do(C,Z),V,{H})$, ${H}\subseteq \{W\}$
  }\label{fig:alg-example}
\end{figure}
%
%
%
\par
During the execution of our algorithm, we call Algorithm~\ref{alg:greedy-eval}: \textbf{Greedy-Eval} several times
% We employ Algorithm~\ref{alg:greedy-eval}: \textbf{Greedy-Eval} that we call from the \textbf{addBestIP} sub-routine 
to find the query with the lowest validation loss among all found surgery estimators. The \textbf{Greedy-Eval} sub-routine takes $\mathcal{H},  Pa(T_J)$ and $T_J\setminus \mathcal{H}$ as arguments, i.e., set of targets $Y$, set of interventions $X$ and set of conditions $W$, respectively.
Here we initialize an array of lists $A$ with $Y$ as the first element. Then the algorithm loops for $|W|-1$ times and each time updates the list at $i+1$-th position of the array $A$ with $A[i]$ and some new variable $K$ (lines~\ref{line:loop-over-w}-\ref{line:AunionK}).
The variable $K$ is chosen from the conditioning set, which combined with $A[i]$, helps reduce the training loss (line~\ref{line:argmin-K}). After the $i+1$-th iteration, $A[i+1]$ indicates the joint variable list in the causal query that we might return as the result.
Finally at line~\ref{line:alg1-return}, we return the query $P(A_{|\mathbf{W}|}|do(\X))$ where $A_{|\mathbf{W}|}$ indicates the joint variable list of the array after the loop ends. This query is received in Algorithm~\ref{alg:id4ip}, and finally output as an invariant predictor with minimum validation loss among the queries that can be produced from the inputs of \textbf{Greedy-Eval}.
%%%% Section: complexity analysis %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%
\par
Next, we show the soundness of ID4IP. Furthermore, we also show that if there exists a graph surgery estimator, then \textbf{ID4IP} outputs a graph surgery estimator: %as shown by Theorem \ref{thm:gurantees-of-finding-invariant-predictor}. Also, Theorem \ref{thm:soundness} ensures that any estimator given by ID4IP is a valid surgery estimator.  

\begin{theorem} 
\label{thm:gurantees-of-finding-invariant-predictor}
If there exists a graph surgery estimator in $G$, \textbf{ID4IP}  %(Algorithm \ref{alg:id4ip}) 
outputs at least one graph surgery estimator.
\end{theorem}


\begin{theorem} \label{thm:soundness}
    \textbf{(Soundness of Algorithm~\ref{alg:id4ip}: ID4IP)} When \textbf{Algorithm~\ref{alg:id4ip}: ID4IP} %(Algorithm \ref{alg:id4ip}) 
    returns an estimator, it is a graph surgery estimator with respect to the given target and the selection variable in $G$. 
\end{theorem}
%
 \section{Complexity Analysis}
In this section, we compare the time complexity of ID4IP with that of the Graph Surgery Estimator (GSE) algorithm (see Algorithm 5 in Section B.4)
% \ref{alg:graph-surgery-estimator})
\cite{subbaswamy2019preventing}. GSE uses various subroutines for converting conditional queries to unconditional queries by checking d-separations.  Fortunately, \cite{shachter2013bayes} has provided an efficient algorithm for checking d-separation condition, which we will incorporate into the analysis of GSE algorithm's time complexity.



\begin{theorem}
\textbf{(GSE Complexity)} \footnote{The algorithm presented in \cite{subbaswamy2019preventing} does not explicitly search over supersets but the proofs of both soundness and completeness of the algorithm indicate that it should search over supersets, which is why we are taking this version as a baseline.} \label{thm:graphsurgery-complexity} Let $|Ch(S)| = C$, $\mathbf{M}= Ch(S)$, $\mathbf{Q} = \mathbf{V}\setminus (\mathbf{M}\cup Y)$. 
Given a causal graph $G=(\mathbf{V},\mathbf{E})$ and disjoint variables $\mathbf{X,Y} \subset V$, the time complexity of Graph Surgery Estimator (GSE) (Algorithm 5 in Section B.4)
% \ref{alg:graph-surgery-estimator}) 
is: $O(2^{2(|\mathbf{V}|-C)- 1} \times B)$, where $B$ represents the time complexity of \textbf{ID} algorithm. 
\end{theorem}

One of the major benefits of using MACS lies in its efficiency.  Combining with greedy search, \textbf{ID4IP} enjoys a polynomial time complexity relative to the complexity of \textbf{ID} algorithm, whereas GSE has exponential time complexity relative to the complexity of \textbf{ID} algorithm.
\begin{theorem} \label{thm:complexity-findMACS}\cite{shpitser2008dormant}
\textbf{Find-MACS-on-set}$(G, \mathbf{Y})$ outputs the MACS of $\mathbf{Y}$ in polynomial time in the size of graph. 
\end{theorem}

\begin{theorem}\label{thm:id4ip-complexity}
    \textbf{(ID4IP  Complexity)} Given a causal graph $G=(\mathbf{V},E)$ and disjoint variables $\mathbf{X,Y} \subseteq \mathbf{V}$, the complexity of \textbf{ID4IP} (Algorithm \ref{alg:id4ip}) is $O(|(C(Y) \cap Nbr(Y))\setminus (T_{Y}\cup T_{\mathbf{C}})| +|Ch(Y)|+1)K + (|T_{Y}| - 1 + |T_{J} | + |T_{J}^{'} | - |\mathcal{H}^{'}| - |\mathcal{H}|) B)$, where $K$ represents the time complexity of \textbf{Find-MACS-on-set} and $B$ represents the time complexity of \textbf{ID} algorithm, $T_Y$ be the MACS of $Y$ in $G$, $T_H$ be the MACS of a child $H$ of $Y$ in $G$, and $T_{\mathbf{C}}\coloneqq \bigcup_{H\in Ch(Y)} T_{H}$.
\end{theorem}
%
%%%%%%%%%%%%%% Experiment %%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%
\section{Experiment}
In this section, we will show a comparison between Graph Surgery Estimator (Algorithm 5 in Section B.4)
% \ref{alg:graph-surgery-estimator}) 
and ID4IP algorithms (Algorithm \ref{alg:id4ip}) in terms of accuracy, run time through both synthetic and semi-synthetic data sets.  Throughout the experiment, we use a server that has $128$ cores CPUs with $126$ GB of memory.  We conduct the experiment in Python programming language. The source code is available at: \href{https://github.com/kenneth-lee-ch/id4ip}{https://github.com/kenneth-lee-ch/id4ip}
%
\begin{figure*}[t!]
\centering
 \subfigure[$16$ observed variables]{\label{fig:4latentrandomDAG}\includegraphics[width=0.32\textwidth]{Figures/n_16test_loss_by_time_comparison.png}}
 \subfigure[$25$ observed variables]{\label{fig:O(logn)}\includegraphics[width=0.32\textwidth]{Figures/n_25test_loss_by_time_comparison.png}}
 \subfigure[$32$ observed variables]{\label{fig:O(n)}\includegraphics[width=0.32\textwidth]{Figures/n_32test_loss_by_time_comparison.png}}

\caption{Comparison between \textbf{ID4IP} (Cyan dashed lines) and \textbf{GSE} (Purple solid lines) in terms of finding graph surgery estimators within a given time limit of $600$ seconds. The horizontal axis represents the time allowed to execute to find the predictors. The vertical axis represents zero-one test loss. Each point represents a graph surgery estimator found by the models. \textbf{Green}: the test loss evaluated by Bayes optimal $P(Y|Pa(Y))$, where $Pa(Y)$ includes the latent confounders as observed. \textbf{Red}: the worst predictive performance by a dummy classifier. }\label{fig:synthetic-exp-part1}
\end{figure*}
\begin{figure*}[t!]
\centering
\subfigure[$16$ observed variables]{\label{fig:n26_training_samplesize}\includegraphics[width=0.32\textwidth]{Figures/_n_16test_loss_by_time_comparison.png}}
\subfigure[$25$ observed variables]{\label{fig:n25_training_samplesize}\includegraphics[width=0.32\textwidth]{Figures/_n_25test_loss_by_time_comparison.png}}
\subfigure[$32$ observed variables]{\label{fig:n32_training_samplesize}\includegraphics[width=0.32\textwidth]{Figures/_n_32test_loss_by_time_comparison.png}}

%
%%%%%%%---Temporarily commented out start
 % \subfigure[$16$ observed variables]{\label{fig:testloss-trainsample-n16}\includegraphics[width=0.32\textwidth]{Figures/g25.png}}
 %
 % \subfigure[$32$ observed variables]{\label{fig:testloss-trainsample-n32}\includegraphics[width=0.32\textwidth]{Figures/g32.png}}
%%%%%%%---Temporarily commented out ends
\caption{
Comparison between \textbf{ID4IP} (Cyan dashed lines) and \textbf{GSE} (Purple solid lines) in terms of sensitivities to training sample size. The horizontal axis represents the number of training samples used to find the graph surgery estimators. The vertical axis represents zero-one loss averaged by using three graph surgery estimators found by three randomly generated training samples. The time limit is set to be $60$ seconds. Graphs in which no graph surgery estimators exist are excluded. 
}
%
% with different numbers of latent confounders within program execution time of $1, 60, 120, 180, 240, 300$ seconds.
% Blue, green, yellow colors represent $n= 4, 8, 16$ respectively
%
\label{fig:synthetic-testloss-trainsample}
\end{figure*}
%
%
% 
%
\subsection{Synthetic Experiment}



In the synthetic experiment, we evaluate the performances of \textbf{ID4IP} and \textbf{GSE} to find graph surgery estimators within a given time limit for the program execution and their sensitivities to training sample size. 

 \subsubsection{Finding graph surgery estimators in a given time limit}
 We randomly generate DAGs of size $n$ by using PyAgrum library in Python \cite{hal-03135721}, where $n \in \{16, 25, 32\}$ with $n / 2$ number of latent confounders. The number of directed edges is set to be $3n$. Each variable follows a Bernoulli distribution with a randomly generated probability between $0$ and $1$. We assign the selection variable to be an ancestor of the target variable. The selection variable is generated by changing the marginal distribution of the assigned variable before generating the test set while fixing the same target variable. For the purposes of showing the efficiency of finding graph surgery estimator, we generate $10000$ training samples and both \textbf{GSE} and \textbf{ID4IP} directly use the population distribution of the training data to learn which graph surgery estimator is best based on the lowest zero-one loss between their predicted labels and actual labels from the training data. 

 Based on Figure \ref{fig:synthetic-exp-part1}, we can see that \textbf{ID4IP} finds graph surgery estimators more efficiently as the number of observed variables increases from $16$ to $32$. Additionally, the graph surgery estimators found by \textbf{ID4IP} often result in lower test loss since \textbf{GSE} cannot find the best graph surgery estimator within the time limit. From Figure \ref{fig:O(n)}, we see that \textbf{GSE} fails to find a graph surgery estimator in the first $100$ seconds. This is possible since \textbf{GSE} needs to check whether a causal query is unidentifiable by calling the ID algorithm repeatedly during the search, whereas each causal query found by \textbf{ID4IP} is identifiable and it only calls on ID algorithm for deriving the estimands of the found graph surgery estimators. 

\subsubsection{Sensitivity to training sample size}

In addition, we evaluate the predictive performance of both \textbf{GSE} and \textbf{ID4IP} by varying the number of training samples. Similar to the previous experiment, we randomly generate DAGs of size $n$, where $n \in \{16, 25, 32\}$ with $n / 2$ number of latent confounders. The number of directed edges is set to be $3n$. Each variable follows a Bernoulli distribution with a randomly generated probability between $0$ and $1$. We evaluate the sensitivity by a range of training sample sizes e.g. $50, 100, 250, 500, 1000, 2500, 5000, 10000$. For each training sample size, we randomly generate three different training samples and report the zero-one test loss averaged and the standard errors based on potentially three different graph surgery estimators respectively found by the models while fixing the same test sample of size $10000$ for all training sample sizes. We fix the time limit to be $60$ seconds for both algorithms to find graph surgery estimators. We also use $30 \%$ of the training data for validation. We set the test loss to $0.5$ if the model fails to find a graph surgery estimator in the given time limit. Furthermore, we adopt a heuristic for learning the training distribution from the observed data. We turn every bidirected edge in the given causal graph to a directed edge while maintaining acyclicity. If the children of the latent confounder already have a directed edge among them, we simply remove that bidirected edge.  We use the \textit{BNLearner} function in PyAgrum, which uses a greedy hill climbing algorithm by default, to learn the conditional probability tables from the observed data with a smoothing prior. We use this approximated training distribution to evaluate any graph surgery estimator found by both models.

From Figure \ref{fig:synthetic-testloss-trainsample}, we see as the training sample size increases, both algorithms achieve lower test loss in general. We also see that the difference in test loss between \textbf{GSE} and \textbf{ID4IP} increases as the number of nodes increases. This is expected as we set the time limit to be $60$ seconds which restricts $\textbf{GSE}$ to find the best graph surgery estimator. 


% \par
% \subsubsection{Varying training sample size}
% We illustrate the change in test loss of both ID4IP and GSE algorithms with varying numbers of training samples. We consider causal graphs with $n \in \{8, 16, 32\}$ number of nodes and $\lfloor n/2 \rfloor $ numbers of latent confounders. For each causal graph,
% we train the algorithms with $\{50, 100, 500, 1000, 10000\}$ number of samples randomly generated from a SCM. 
% For each specific training parameter, we fit both algorithms on three random training sample sets and evaluate their performance in terms of $0-1$ test loss with 10k test samples.
% We let each algorithm execute for a maximum of $600s$ to find the predictors. 


%  The predictive performance of \textbf{ID4IP} (blue-dashed line) and \textbf{GSE} (red-solid line) is visualized in Figure~\ref{fig:synthetic-testloss-trainsample}. We also show $P(Y|Pa(Y))$ and the worst case $(i.e.,0.5)$. In each setting i.e. $n=\{8,16,32\}$ shown in Figure~\ref{fig:testloss-trainsample-n8}-~\ref{fig:testloss-trainsample-n32}, we observe that the test loss of \textbf{ID4IP} and \textbf{GSE} is high for low training sample size and decreases when we have more samples in the training dataset.
%  However, the test loss is higher for large graphs compared to smaller graphs, since it is more challenging to fit the training dataset and achieve better predictions.
% Note that the slope of each line is not steep. This implies that ID4IP is not highly dependent on the training sample size.
% Finally, even though we observe comparable performance for both algorithms, the execution time for \textbf{GSE} is around $400\%-800\%$ higher than \textbf{ID4IP} which illustrates the superiority of our algorithm. 



\subsection{Semi-synthetic Data}
%
%

\begin{table}[t]
\begin{tabular}{c|c|c}
    \toprule
    Algorithm & Sachs (11 nodes) & Alarm (37 nodes)\\
    \midrule
    \textbf{GSE} & $0.80$ & $0.57$\\
    \textbf{ID4IP} & $0.80$ & $0.83$\\
    \textbf{LR} & $0.53$ & $0.52$ \\
    \bottomrule
  \end{tabular}
  \caption{Algorithms performance comparison based on micro-averaged F1 score within time limit of $120$ seconds}\label{tab:semi-synthetic-exp}
\end{table}

\begin{figure}
    \centering
\includegraphics[scale=0.15]{Figures/modified_sachs.pdf}
    \caption{Modified Sachs causal graph} \label{fig:sachs_causal_graph}
\end{figure}

% \begin{figure}[t!]
% \centering
%  \subfigure[Original Sachs causal graph with the target variable (yellow)]{\label{fig:original_sachs}\includegraphics[scale=0.15]{Figures/original_sach.pdf}}
%  \subfigure[Modified Sachs causal graph with the selection variable (green)]{
% \label{fig:sachs_causal_graph}\includegraphics[scale=0.15]{Figures/modified_sachs.pdf}
%  }
%  \subfigure[Algorithms performance comparison based on micro-averaged F1 score within time limit of $120$ seconds]{
%   \begin{tabular}{c|c|c}
%     \toprule
%     Algorithm & Sachs (11 nodes) & Alarm (37 nodes)\\
%     \midrule
%     \textbf{GSE} & $0.80$ & $0.57$\\
%     \textbf{ID4IP} & $0.80$ & $0.83$\\
%     \textbf{LR} & $0.53$ & $0.52$ \\
%     \bottomrule
%   \end{tabular}}
% \caption{Semi-synthetic experimental results }\label{tab:semi-synthetic-exp}
% \end{figure}
\balance

There are several practical scenarios where the computational demand of the graph surgery estimator is a bottleneck. For example, some researchers investigate the distribution shifts problem due to medical record transfer \cite{agniel2018biases, schrouff2022diagnosing}. Although the causal graph mentioned in \cite{schrouff2022diagnosing} is not large, it is evident that causal graphs learned from medical data can be well over 100 nodes \cite{nordon2019building}.  In this experiment, we further demonstrate the use case of having causal graphs given as a priori information and the utility of our proposed algorithm in invariant prediction by using causal graphs and semi-synthetic data provided motivated by real-world settings.  
 \subsubsection{Sachs dataset}
Sachs dataset measures the expression level of numerous proteins and phospholipids in human cells \cite{sachs2005causal}. It consists of the concurrent measurements of $11$ phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells. Each variable has 3 states. We set the target to be the variable \textit{Akt}.  The original causal graph of this dataset is shown in the supplementary material. For the purpose of this experiment, we modified it to be the graph as Figure \ref{fig:sachs_causal_graph}. We treat the variable \textit{Raf} as the selection variable by converting all of its class $2$ to be class $1$ and using only samples that have \textit{Raf} being class $0$ to train the models. The test sample is then generated by taking all the samples that have \textit{Raf} being class $1$. The training sample size is $3632$ and the test sample size is $3368$. We further use $30 \%$ of the training data to validate the models before testing. We also train a logistic regression model (\textbf{LR}) with the predictors \textit{PIP3}, \textit{PIP2}, \textit{Plcg} only.  We report micro-averaged F1 score for \textbf{GSE}, \textbf{ID4IP}, and \textbf{LR} in table \ref{tab:semi-synthetic-exp}. We see that both \textbf{GSE}, \textbf{ID4IP} outperform \textbf{LR} as expected as \textbf{LR} suffers from the distribution shift. We can also see that \textbf{GSE} achieves the same test loss
as \textbf{ID4IP} since the causal graph is reasonably small. 
 
  \subsubsection{Alarm dataset}
We also consider a larger causal graph provided by \cite{beinlich1989alarm}, which consists of $37$ observed variables. We modified the graph such that it includes $7$ latent confounders and $28$ observed variables by treating some of its original observed variables as latent. Each variable has a number of states varying from $2$ to $4$. We provide the original causal graph and its modified version in the supplementary material. We picked the binary variable \textit{DIFFICULTY} to be the selection variable and \textit{BP} to be the target. All models are trained on $1462$ training samples while the shifted test sample size is $13538$. \textbf{LR} only uses the features \textit{HISTORY}, \textit{LVEDVOLUME}, \textit{STROKEVOLUME}, \textit{CVP}, and \textit{PCWP}. From table \ref{tab:semi-synthetic-exp}, we see that $\textbf{ID4IP}$ achieves the lowest test loss among all models. It is because \textbf{GSE} fails to find the best graph surgery estimator within the time limit and \textbf{LR} also suffers from distribution shifts. 
%
%

% \begin{table}
%   \caption{Alarm dataset}
%   \label{table:alarm}
%   \small 
%   \centering
%   \begin{tabular}{c|c}
    
%     \toprule  Algorithm     & Micro-averaged F1 score      \\
%     \midrule
%     \textbf{GSE}& $ 0.72 $ \\
%      \textbf{ID4IP}& $0.82 $ \\
%       \textbf{LR}& $0.51 $ \\
%     \bottomrule
%   \end{tabular}
% \end{table}

\section{Conclusion}
We presented an algorithm that efficiently finds predictors invariant to distribution shifts and guarantees to find at least one if there exists any for a data-generating process that is correctly modeled as a causal graph. In particular, we utilize a graphical characterization of the identifiability of conditional causal queries with greedy search to increase the efficiency of finding invariant predictors. Our algorithm is sound that runs in polynomial time in contrast to the existing work that requires exponential time. As shown by the numerical experiments, our proposed algorithm has significantly reduced run time to reach predictive performance similar to the existing work within given time limits.
In the future, it is worthwhile to develop the completeness of the algorithm. Another direction is to understand the approximation guarantees for greedy-search methods for invariant causal prediction. We also want to explore the idea of finding invariant predictors with weak confounding variables.  
\section{Acknowledgements}
This research has been supported in part by NSF Grant CAREER $2239375$.

%
% References
\clearpage
\bibliographystyle{plain}
\bibliography{lee_773}

\end{document}
