%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

% Other packages, macros
\usepackage{hyperref}
\usepackage{amsfonts}
\usepackage{bm}
\usepackage{enumitem} 
\usepackage{comment}
\usepackage{amsthm}
\usepackage{xfrac}
\usepackage{algorithm}
\usepackage{algorithmic}
\newtheorem{theorem}{Theorem}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{prop}{Proposition}
\newtheorem*{prop*}{Proposition}

\input{macros.tex}

\title{Causal Discovery Under a Confounder Blanket}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<david.watson@ucl.ac.uk>?Subject=Your UAI 2022 paper}{David~S.~Watson}{}}
\author[1]{Ricardo~Silva}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistical Science\\
    University College London\\
    London, UK
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering the causal order among a subgraph of variables known to descend from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing background information. Under a structural assumption called the $\textit{confounder blanket principle}$, which we argue is essential for tractable causal discovery in high dimensions, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We present a structure learning algorithm that is provably sound and complete with respect to a so-called $\textit{lazy oracle}$. We design inference procedures with finite sample error control for linear and nonlinear systems, and demonstrate our approach on a range of simulated and real-world datasets. An accompanying $\texttt{R}$ package, $\texttt{cbl}$, is available from $\texttt{CRAN}$.
  %We derive a sound and complete algorithm for learning causal structure under these conditions and implement testing procedures with provable error control for linear and nonlinear systems. We demonstrate our approach on a range of simulated and real-world datasets. Our method is implemented in an easy-to-use $\texttt{R}$ package, $\texttt{cbl}$, available on $\texttt{CRAN}$.
\end{abstract}

\section{Introduction}\label{sec:intro}
Discovering causal relationships between variables is a vital first step in any effort to understand complex systems or design effective interventions. In principle, such relationships can be established through sufficient experimentation; in practice, we must often make do with observational data due to logistical or ethical constraints. Causal discovery algorithms have been in use for decades---see \citep{glymour2019} for a recent review---but the task is notoriously difficult and error-prone, especially in high-dimensional settings. Moreover, many of these methods, for computational or statistical tractability, assume {\it scale-free sparsity}---i.e., that the number of adjacencies for each vertex in the true graph does not grow with the dimensionality of the problem.

\begin{figure}[t!]
    \centering
    \includegraphics[width=\columnwidth]{splash.pdf}
    %\vspace{-5mm}
    \caption{Visual depiction of our setup, which includes a large collection of background variables $\bm{Z}$ (blue nodes) with arbitrary structure, followed by a relatively small set of foreground variables $\bm{X}$ (orange nodes). The goal is to learn causal relationships among $\bm X$ variables by exploiting signals from $\bm{Z}$.}
    %\vspace{-4mm}
    \label{fig:splash}
\end{figure}

In many cases, researchers are interested primarily in the causal relationships between just a subset of observed variables. 
%For instance, the focus in biological studies is often some particular regulatory pathway(s) rather than the entire genome. 
Attempting to learn an entire directed acyclic graph (DAG) in such cases is inefficient and unstable, especially when error rate control is a concern and unmeasured confounders cannot be ruled out.  Suppose, however, that we have access to a large tier of background factors $\bm{Z}$ that may potentially deconfound our target system $\bm{X}$. This stratification could be due to temporal ordering or physical laws. For example, we know that genotypes precede phenotypes, even though it may be impossible to completely characterize the relationship between the two, let alone links among the genotypes themselves. 
%Sometimes, we can make the case that we have access to a large tier of ancestral background factors $\bm{Z}$ that may potentially deconfound a target causal system of interest $\bm{X}$. 
%However, we cannot plausibly apply this reasoning to $\bm{Z}$ itself, i.e. recursively claim that we have access to an even earlier set of background factors to deconfound the deconfounders. 
We argue that many practical problems of interest exhibit such a two-tier structure, with our \emph{foreground} variables $\bm{X}$ causally preceded by some large \emph{background} set $\bm{Z}$, whose internal structure is not relevant or even well-defined. 

%We propose a novel solution specifically designed for {\it targeted} directed acyclic subgraph discovery. 
We propose a novel structure learning algorithm designed for such setups. Our method leverages ``pre-system'' background covariates $\bm{Z}$ to establish causal relationships among foreground variables $\bm{X}$ without making any assumptions about the sparsity of connections between the two tiers. The trade-off is that it will not attempt to discover every possible structural signature that a typical causal discovery method can in theory resolve \citep{Spirtes2000}. Instead, we limit ourselves to what can be derived from the background-foreground interaction. In particular, we posit that background variables can act as a \emph{confounder blanket}, which, as a whole, either blocks unmeasured confounding or not. This amounts to a bet that we can avoid combinatorial search over subsets of $\bm Z$ and still get informative results.
%We show how to exploit differential sparsity patterns between a large set of background variables and a relatively small set of foreground variables to detect edges between nodes and estimate causal effects in the target subDAG. 
%We present a sound and complete algorithm for this task and demonstrate methods for type I error control in linear and nonlinear structural causal models. Experiments confirm that our algorithm reliably recovers information about the targetDAG of interest in a range of simulated and real-world settings. 

Our main contributions are threefold. (1) We derive a sound and complete set of rules for inferring causal order in subgraphs with background variables, as well as identifiability conditions for causal discovery in such settings. Completeness is derived with respect to a so-called \textit{lazy oracle}, which we argue is of greater practical relevance in many settings than the classical independence oracle, especially when we are concerned about statistical and computational feasibility. (2) We design an algorithm that implements these rules with finite sample error control, making a further assumption about how to test for statistical independencies based on regression models. The method is efficient and flexible, avoiding the combinatorial search associated with alternative methods and accommodating both linear and nonlinear systems. (3) We test our approach against a range of alternatives on simulated and real-world data, confirming that the method recovers ancestral relationships in the target subgraph with high power and bounded error. 

%The remainder of this paper is structured as follows. We briefly review relevant background concepts (Sect.~\ref{sec:background}) before formalizing our assumptions and problem statement (Sect.~\ref{sec:problem_statement}). We present a solution based on the iterative application of three simple inference rules, which we show to be sound and complete for causal discovery under a confounder blanket (Sect.~\ref{sec:oracle}). We describe a practical algorithm for finite sample inference (Sect.~\ref{sec:inference}) and demonstrate its performance in comparison to several alternatives on a variety of tasks (Sect.~\ref{sec:exp}). We conclude with a discussion of limitations and directions for future work (Sect.~\ref{sec:discussion}). 

\section{Background and Notation}\label{sec:background}

We assume that causal relationships can be encoded as a DAG $\mathcal{G}$. Each vertex in $\mathcal G$ represents a random variable in a distribution with density/mass function $p(\cdot)$. We make use of the following common terminology in causal discovery: \emph{parent}, \emph{child}, \emph{ancestor}, \emph{descendant}, \emph{mediator}, \emph{collider}, \emph{(active/backdoor) path}, \emph{d-separation}, and \emph{Markov equivalence class}. We omit formal definitions due to space constraints. For details, see \citep{Spirtes2000, Pearl2009}.

%A \emph{path} in $\mathcal G$ is a sequence of edges, connected in the intuitive sense. A \emph{collider} on a path is a vertex which is at the arrow endpoint of two consecutive edges. If $X$ is an \emph{ancestor} of $Y$ in $\mathcal G$, we denote it as $X \in an(Y)$ and $X \prec Y$ or $Y \succ X$. We use $pa(Y)$ to denote the set of \emph{parents} of a vertex, which follow from directed edges such as $X \rightarrow Y$, i.e. $X \in pa(Y)$. If $X \succ Y$ or $Y \prec X$, we say that $X$ is a \emph{descendant} of $Y$, $X \in de(Y)$; and $X$ is a \emph{child} of $Y$, $X \in ch(Y)$, if $Y \rightarrow X$ exists in $\mathcal G$. If $X$ is neither an ancestor nor a descendant of $Y$, then we write $X \sim Y$. We write $X \preceq Y$ if $X$ is \emph{not} a descendant of $Y$, in which case $X$ may or may not be an ancestor of $Y$. Notice that $X\not \prec Y$ implies $Y \preceq X$. Relations $\prec$ and $\preceq$ can be applied to a pair of sets, implying that the relation holds to each pair of elements from the Cartesian product of the respective sets.

%A DAG $\mathcal G$ is a model of conditional independence constraints among sets of random variables represented as vertices in the DAG. Such independencies can be read-off from $\mathcal G$ using a concept called \emph{d-separation} \citep[Ch.~1]{Pearl2009}. 
We use $\bm{X} \dsep \bm{Y}~|~\bm{Z}$ to denote that set $\bm{X}$ is $d$-separated from set $\bm{Y}$ given set $\bm{Z}$ in $\mathcal G$. 
The notation is deliberately similar to that of conditional independence in probability theory \citep{Dawid1979}. 
We stipulate that the joint distribution of the data is Markov with respect to $\mathcal G$, i.e. $\bm{X} \dsep \bm{Y}~|~\bm{Z} \Rightarrow \bm{X} \indep \bm{Y}~|~\bm{Z}$. When the distribution is \textit{faithful} to the graph, the converse holds as well, upgrading the relationship to a biconditional: $\bm{X} \dsep \bm{Y}~|~\bm{Z} \Leftrightarrow \bm{X} \indep \bm{Y}~|~\bm{Z}$.

Given two vertices $X$ and $Y$ in $\mathcal G$, we use $X \prec Y$ to denote that $X$ is an ancestor of $Y$. Equivalently, $Y \succ X$ denotes that $Y$ is a descendant of $X$. 
%We use $pa(X)$ to denote the set of parents of a vertex. We write $X \sim Y$ if $X$ is neither an ancestor nor a descendant of $Y$. 
We write $X \preceq Y$ if $X$ is \emph{not a descendant} of $Y$, in which case $X$ may or may not be an ancestor of $Y$. (Note that $X\not \prec Y$ implies $Y \preceq X$.)
We write $X \sim Y$ when neither variable is an ancestor of the other, i.e. $X \not \prec Y$ and $Y \not \prec X$. 
In acyclic graphs, the ancestry relation imposes a \textit{strict partial order} characterized by the following three properties:
\begin{itemize}[noitemsep]
    \item \emph{Irreflexivity}: $X \prec X \vdash$ \texttt{FALSE}.
    \item \emph{Asymmetry}: $X \prec Y \vdash Y \not \prec X$. 
    \item \emph{Transitivity}: $X \prec Y \land Y \prec Z \vdash X \prec Z$.
\end{itemize}
Relations $\prec$ and $\preceq$ can be applied to pairs of sets, implying that the relation holds between each pair of elements from the Cartesian product of the respective sets.

Let $\mathbb E[Y~|~do(X = x)]$ denote the expected \emph{outcome} $Y$ under an intervention that fixes the \emph{treatment} $X$ to level $x$. 
%See \citet[Ch.~3]{Pearl2009} for a formal definition of the $do(\cdot)$ operator. 
\emph{Covariate adjustment} postulates that 
\begin{align*}
    \mathbb{E}[Y~|~do(X = x)] = \int \mathbb E[Y~|~x, \bm z] ~p(\bm z)~d\bm z
\end{align*}
for a set of vertices $\bm Z$ in $\mathcal G$. It holds if $\bm Z$ satisfies the \emph{backdoor criterion} with respect to $(X, Y)$ \citep[Ch.~3]{Pearl2009}, in which case we say that $\bm Z$ is a \emph{valid adjustment set} for $(X, Y)$. 
Complete graphical criteria for covariate adjustment can be found in \citep{shpitser2010,perkovic2018}.
%Complete graphical criteria for covariate adjustment can be found in \citep{perkovic2018}.
%The $do$-calculus provides a complete set of rules for inferring causal effects from observational data when graphical structure permits \citep{shpitser2008}.

Finally, we define minimal (de)activators, originally highlighted by \citet{Claassen2011}:

\begin{definition}[Minimal activator]
    Variable $D$ is a \emph{minimal activator} of the relationship between $A$ and $B$ given $\bm{C}$ iff: (1) $A \dep B ~|~ \bm{C} \cup D$; and (2) $A \indep B ~|~ \bm{C}$. In this case, we write $A \dep B ~|~ \bm{C} \cup [D]$. $\Box$
\end{definition}

\begin{definition}[Minimal deactivator]
    Variable $D$ is a \emph{minimal deactivator} of the relationship between $A$ and $B$ given $\bm{C}$ iff: (1) $A \indep B ~|~ \bm{C} \cup D$; and (2) $A \dep B ~|~ \bm{C}$. In this case, we write $A \indep B ~|~ \bm{C} \cup [D]$. $\Box$
\end{definition}

\section{Problem Statement}
\label{sec:problem_statement}

Assume a DAG $\mathcal G$ contains observable vertices $\bm Z \cup \bm{X}$, consisting of background variables $\bm{Z}$ and foreground variables $\bm{X}$. Let $|\bm{Z}| = d_Z$ and $|\bm{X}| = d_X$, with potentially $d_Z \gg d_X$. $\mathcal G$ may have unobserved vertices $\bm U$ with more than one descendant in $\bm Z \cup \bm{X}$ (i.e., unmeasured confounders). We construct the latent projection of $\mathcal{G}$ by marginalizing over hidden variables, replacing any path $X_i \leftarrow U_{ij} \rightarrow X_j$ with a bidirected edge $X_i \leftrightarrow X_j$ to form an acyclic directed mixed graph (ADMG) with vertex set $\bm Z \cup \bm{X}$. The symbol $\mathcal G^{\backslash \bm U}$ will denote the ADMG of $\mathcal G$. 

%Without loss of generality \citep[Sect.~2.6]{Pearl2009}, we assume that each $U \in \bm{U}$ has no parents and exactly two children in $\bm Z \cup \bm{X}$. We may then replace any path $X_i \leftarrow U_{ij} \rightarrow X_j$ with a bidirected edge $X_i \leftrightarrow X_j$, forming an acyclic directed mixed graph (ADMG) with vertex set $\bm Z \cup \bm{X}$. Its Markov properties follow from marginalizing the hidden variables of the corresponding DAG \citep{richardson2003}. The symbol $\mathcal G^{\backslash \bm U}$ will denote the ADMG of $\mathcal G$. 

%Without loss of generality \citep[Sect.~2.6]{Pearl2009}, we assume that each $U \in \bm{U}$ has no parents and exactly two children in $\bm Z \cup \bm{X}$. Interchangeably, we can replace any path $X_i \leftarrow U_{ij} \rightarrow X_j$ with a \emph{bidirected edge} $X_i \leftrightarrow X_j$, forming an \emph{acyclic directed mixed graph} (ADMG) with vertex set $\bm Z \cup \bm{X}$. Its Markov properties follow from marginalizing the hidden variables of the corresponding DAG \citep{richardson2003}. The symbol $\mathcal G^{\backslash \bm U}$ will denote the ADMG of $\mathcal G$. 

The goal is to infer as much as possible about the causal structure of $\mathcal{G}_{X} \subset \mathcal G$, which consists of vertices $\bm{X}$ and the edges with endpoints in $\bm{X}$. We make the following assumptions:
\begin{enumerate}[noitemsep,align=left]
    \item[(A1)] $\mathcal{G}$ is acyclic.
    \item[(A2)] $p(\bm{z}, \bm{x})$ is faithful to $\mathcal{G}^{\backslash \bm U}$.
    \item[(A3)] $\bm{Z}$ contains no descendant of $\bm{X}$ in $\mathcal G$, i.e. $\bm{Z} \preceq \bm{X}$. 
\end{enumerate}
The first assumption can be relaxed---$\mathcal{G}_Z$ may contain cycles under some conditions---but we adopt it here to avoid further technicalities. Faithfulness is a common yet somewhat controversial starting point for many causal discovery procedures (more on this in Sect.~\ref{sec:discussion}). The ordering assumption applies in many settings where background knowledge permits a categorical distinction between upstream and downstream variables, e.g. when data are recorded at different times.

For any pair of variables $X, Y \in \bm{X}$, exactly one of three possibilities obtains: (G1) $X \prec Y$; (G2) $X \succ Y$; or (G3) $X \sim Y$. Our discovery problem is defined as deciding which relationship holds for each pair of vertices in $\mathcal{G}_X$. A similar goal motivates \citet{magliacane2016}, who derive a general algorithm called ancestral causal inference (ACI). ACI does not exploit the the background-foreground split and does not scale to high dimensionality. It also comes with no theory about the error control of its inferences. Note that some relationships in $\mathcal{G}_X$ may be only partially identifiable, e.g. if all we can determine is that $X \preceq Y$. Others may be entirely unidentifiable, e.g. if latent confounding is present. 
%If we succeed in learning which structure holds for each variable pair, then we recover the complete subgraph $\mathcal G_X$. In general, however, we are left with ancestral relationships only.

In the next section, we describe a causal discovery algorithm that assumes we have an \emph{oracle} capable of returning exact information about which conditional independencies hold in the population. This is so that we can more easily discuss the limits of what can in principle be discovered from the assumptions provided. In Sect.~\ref{sec:inference}, we present a practical statistical algorithm with error control guarantees.

%The oracle algorithm of the next section is particularly influenced by inference rules originally proposed by \citet{entner2013}. In their setup, we have a single pair of variables $\{X, Y\}$ and assume that $X \preceq Y$. They use background variables $\bm{Z}$ to determine whether $X \rightarrow Y$ or $X \sim Y$. In the former case, they show that causal effects can be estimated via the \textit{backdoor criterion} with some admissible set $\bm{S} \subset \bm{Z}$ \citep[Ch.~3]{Pearl2009}. The structural information we discover can also be used to estimate causal effects by similar methods, but our presentation will focus instead on recovering structural information about $\mathcal G_X$.

%What we do not know is whether a set of given pre-treatment variables $\bm{Z}$ is able to inform us that either (i) $X$ does not cause $Y$; or (ii) some subset of $\bm{Z}$ can be used as an adjustment set for an identification rule called the \emph{backdoor criterion}: that is, whether we can estimate quantitative causal effects on $Y$ of interventions on $X$ by stratifying on and averaging over $\bm{Z}$ --- see \citet{Pearl2009} for details. The structural information we discover can also be used to estimate causal effects by a similar criterion, but for our presentation will concern the task of recovering structural statements about $\mathcal G_X$.

\section{Confounder Blankets and The Oracle Algorithm}
\label{sec:oracle}

There are sound and complete procedures, based on the fast causal inference (FCI) algorithm of \citet{Spirtes2000}, which return all and only those graphs that are Markov equivalent to the true $\mathcal G$ \citep{zhang_complete_fci}.\footnote{FCI returns a partial ancestral graph (PAG), an equivalence class of maximal ancestral graphs (MAGs) \citep{richardson_mags}. By contrast, the PC algorithm, which assumes causal sufficiency, returns a completed partially directed acyclic graph (CPDAG), an equivalence class of DAGs \citep{Spirtes2000}.} 
%By contrast, the PC algorithm, which assumes causal sufficiency, returns a completed partially directed acyclic graph (CPDAG), an equivalence class of DAGs \citep{Spirtes2000}.}
%\footnote{In theory, their output can be refined with background knowledge, such as partial orientation, by simply excluding those members of the equivalence class that are incompatible with the given knowledge.}
Such methods scale poorly with data dimensionality, as they must query for conditional independence over an exponentially increasing number of candidate conditioning sets. 
%Such methods can be computationally very expensive, as they query for conditional independence constraints where the number of candidates for conditioning may explode exponentially as a function of the number of observable variables. 
For tractability, sometimes it is assumed that $\mathcal G$ is sparse or small \citep[e.g.,][]{magliacane2016}, an unrealistic assumption if we think each element of $\bm X$ should be directly connected to a substantive fraction $\mathcal O(d_Z)$ of background variables---a type of structure taken for granted in most methods that estimate causal effects by covariate adjustment \citep{Hernan2020}. Instead, this work is based on the following principle:

\begin{definition}[The Confounder Blanket Principle, CBP] In the presence of a large set of background variables $\bm Z$, where it is believed that each element of $\bm X$ may be adjacent to $\mathcal O(d_Z)$ elements of $\bm Z$ in $\mathcal G^{\backslash \bm U}$, do not attempt to test for conditional independencies using arbitrary subsets of $\bm Z$. In particular, work under the expectation that if some $\bm A \subset \bm Z \cup \bm X$ is a valid adjustment set for any ordered pair $X_i \prec X_j$, then $\bm A \cup \bm Z$ is also valid. We call a set of background variables with this property a \emph{confounder blanket}. $\Box$
\end{definition}

A failure of CBP does not compromise the \emph{soundness} of the algorithms presented in the sequel, but it may affect their \emph{completeness}. 
In particular, under CBP, we are exposed to the problem of \emph{M-structures} \citep{Pearl2009M}, where some $Z \in \bm{Z}$ is a collider on a path $X_i \leftarrow \dots \rightarrow Z \leftarrow \dots \rightarrow X_j$. Without searching through subsets of $\bm{Z}$, it may be difficult or impossible to infer the causal order of $\mathcal{G}_X$ in this setting.
%In particular, under CBP, we are exposed to the problem of \emph{M-structures} \citep{Pearl2009M}, where a path $X_i \leftarrow \dots \rightarrow Z_k \leftarrow \dots \rightarrow X_j$ is active given $\bm Z$, but no other paths between $X_i$ and $X_j$ are active given (some subset of) $\bm Z\backslash\{Z_k\}$. This means we might miss constraints that would allow us to infer that $X_i \prec X_j$.

M-structures can indeed make a substantive impact to the bias of an adjustment set. However, \citet{Ding2015} have shown that, at least at under some reasonable distributions of parameters in some parametric models, their impact may be negligible with high probability, and hence, statistically hard to detect in a causal discovery method. Instead of proposing yet another derivative of FCI, we believe that practitioners with access to a large set of background variables---which may be required in order to stand a chance against unmeasured confounding---are better served by methods grounded in the CBP. 

\subsection{Structural Signatures and Algorithm}

Our algorithm will be based on the following inference rules, adapted from \citet{entner2013} and \citet{magliacane2016}. 
In what follows, let  $\bm A$ and $\{X, Y\}$ be two sets of observable vertices in a DAG $\mathcal{G}$, where $\bm A \preceq \{X, Y \}$, and let $\bm A_{\backslash W} := \bm A \backslash \{W\}$ for some vertex $W$. 
%In what follows, let  $\bm A$ and $\{X, Y\}$ be two sets of vertices in an ADMG $\mathcal{G}^{\backslash \bm U}$, where $\bm A \preceq \{X, Y \}$, and let $\bm A_{\backslash W} := \bm A \backslash \{W\}$ for some vertex $W$. 
Our first rule detects (indirect) causes via relations of minimal independence:
\begin{enumerate}[align=left]
    \item[(R1)] If $\exists W \in \bm{A}: W \indep Y ~|~ \bm A_{\backslash W} \cup [X]$, then $X \prec Y$.
\end{enumerate}
The soundness of (R1), and (R2) below, follows immediately from Lemma 1 of \citet{magliacane2016}, combined with the partial order $\bm A \preceq \{X, Y\}$. (R1) applies when $X$ \emph{deactivates} all paths from $W$ to $Y$. When this structure obtains, causal effects can be estimated using the backdoor adjustment with admissible sets $\bm A$ and $\bm A_{\backslash W}$.

Our second inference rule eliminates (indirect) causes via relations of minimal dependence:
\begin{enumerate}[align=left]
    \item[(R2)] If $\exists W \in \bm{A}: W \dep X ~|~ \bm A_{\backslash W} \cup [Y]$, then $X \preceq Y$.
\end{enumerate}
(R2) applies when $Y$ \emph{activates} some path from $W$ to $X$. This means that $Y$ must be a (descendant of a) collider on that path, and cannot be a non-collider on any other path active under ${\bm A}_{\backslash W}$.

Our third rule establishes causal independence via separating sets, and follows immediately from faithfulness:
\begin{enumerate}[align=left]
    \item[(R3)] If $X \indep Y ~|~ \bm{A}$, then $X \sim Y$.
\end{enumerate}

These building blocks are the basis for our confounder blanket learner (CBL), outlined in Alg. \ref{alg:cb_oracle}. $\textsc{CBL-Oracle}$ outputs a square, lower triangular ancestrality matrix $\mathbf{M}$, with $\mathbf{M}_{ij}$ representing the partial order between vertices $(X_i, X_j)$. The subscript $\bf M$ on a partial ordering relation indicates that it is already encoded in the matrix, which evolves with each pass through the for loop. The oracle $\mathcal{I}$ is an indicator function over conditional independencies on $p(\bm{z}, \bm{x})$. Note that inferences derived via (R2) are recorded as conjuncts, since they are consistent with multiple structures. The $\textsc{Closure}$ operation, fully articulated in Appx. A, ensures that $\mathbf{M}$ satisfies the characteristic properties of a strict partial order, thereby reducing conjunctions to their most informative implication.

\begin{algorithm}[t]
    \small
   \caption{{\sc CBL-Oracle}}
   \label{alg:cb_oracle}
\begin{algorithmic}
   \STATE {\bfseries Input:} Background set $\bm{Z}$, foreground set $\bm{X}$, oracle $\mathcal I$
   \STATE {\bfseries Output:} Ancestrality matrix $\mathbf{M}$
   \STATE
   \STATE Initialize: $\texttt{converged} \gets \texttt{FALSE}, ~{\mathbf{M}} \gets [\tt{NA}]$

   \WHILE{\textbf{not} $\texttt{converged}$}
        \STATE $\texttt{converged} \gets \texttt{TRUE}$
        \FOR{$X_i, X_j \in \bm{X}$ such that $i>j, ~\mathbf{M}_{ij} = [{\tt NA}]$}
            \STATE $\bm{A} \gets \bm Z \cup \big\{X \in \bm{X} \backslash \{X_i, X_j\} : X \preceq_{\bf M} \{X_i, X_j\} \big\}$
            \IF{$\mathcal I(X_i \indep X_j ~|~ \bm A)$}
                \STATE $\mathbf{M}_{ij} \gets i \sim j$,~
                       $\texttt{converged} \gets \texttt{FALSE}$
                %\STATE \textbf{continue}
            \ELSE
            \FOR{$W \in \bm A$}
                %\STATE $\bm A_{\backslash W} \gets \bm A \backslash \{W\}$
                \IF{$\mathcal I(W \indep X_j ~|~ \bm A_{\backslash W} \cup [X_i])$}
                    \STATE $\mathbf{M}_{ij} \gets i \prec j$,~
                           $\texttt{converged} \gets \texttt{FALSE}$
                \ELSIF{$\mathcal I(W \indep X_i ~|~ \bm A_{\backslash W} \cup [X_j])$}
                    \STATE $\mathbf{M}_{ij} \gets j \prec i$,~
                           $\texttt{converged} \gets \texttt{FALSE}$
                \ELSIF{$\mathcal I(W \dep X_j ~|~ \bm A_{\backslash W} \cup [X_i])$}
                    \STATE $\mathbf{M}_{ij} \gets \mathbf{M}_{ij} \land j \preceq i$,~
                           $\texttt{converged} \gets \texttt{FALSE}$
                \ELSIF{$\mathcal I(W \dep X_i ~|~ \bm A_{\backslash W} \cup [X_j])$}
                    \STATE $\mathbf{M}_{ij} \gets \mathbf{M}_{ij} \land i \preceq j$,~
                           $\texttt{converged} \gets \texttt{FALSE}$
                \ENDIF
            \ENDFOR
            \ENDIF
        \ENDFOR
        %\FOR{$X_i, X_j \in \bm{X}~\text{such that}~\mathbf{M}_{ij}=i \preceq j$}
        %    \FOR{$W \in \bm{A}$}
        %        \IF{$\big( \mathcal{I}(W \dep X_i | \bm{A}_{\backslash W}) \land \mathcal{I}(W \indep X_j | \bm{A}_{\backslash W}) \big) \lor \big( \mathcal{I}(W \indep X_i | \bm{A}_{\backslash W}) \land \mathcal{I}(W \dep X_j | \bm{A}_{\backslash W}) \big)$}
        %            \STATE{$\mathbf{M}_{ij} \gets i \sim j$,~ $\texttt{converged} \gets \texttt{FALSE}$}
        %        \ENDIF
        %    \ENDFOR 
        %\ENDFOR 
        \STATE $\mathbf{M} \gets \textsc{Closure}(\mathbf{M})$
        %\FOR{$i, j, k \in \{1, \dots, d_Z\} ~\text{such that}~ i \neq j \neq k$} 
        %    \IF{$i \prec_{\bf M} j \prec_{\bf M} k \land \mathbf{M}_{ik} \neq i \prec k$}
        %        \STATE $\mathbf{M}_{ik} \gets i \prec k$,~ $\texttt{converged} \gets \texttt{FALSE}$
        %    \ENDIF
        %    \IF{$i \preceq_{\bf M} j \land i \succeq_{\bf M} j \land \mathbf{M}_{ij} \neq i \sim k$}
        %        \STATE $\mathbf{M}_{ij} \gets i \sim k$,~ $\texttt{converged} \gets \texttt{FALSE}$
        %    \ENDIF
        %\ENDFOR
   \ENDWHILE
\end{algorithmic}
\end{algorithm}


\subsection{Properties}
Proofs for all theorems are given in  Appx. A. 
\begin{theorem}[Soundness]\label{thm:soundness}
    All ancestral relationships returned by {\sc CBL-Oracle} hold in the true $\mathcal G_X$. Moreover, if $\mathbf{M}_{ij} = i \prec j$, then the set of shared non-descendants $\bm{A} = \bm{Z} \cup \big\{X \in \bm{X} \backslash \{X_i, X_j\}: X \preceq_{\bf M} \{X_i, X_j\}\big\}$ is a valid adjustment set for $(X_i, X_j$).
\end{theorem}
By design, {\sc CBL-Oracle} can be uninformative where a method like FCI will provide a causal order. One of the simplest examples is the so-called \emph{Y-structure} \citep{mani2006}, $\{X_1 \rightarrow X_3, X_2 \rightarrow X_3, X_3 \rightarrow X_4\}$, where FCI discovers $X_3 \rightarrow X_4$. By contrast, with an empty $\bm Z$, {\sc CBL-Oracle} cannot infer any causes (though it may still infer $X_1 \sim X_2$ via (R3)). However, the presence of a single edge from a background variable $Z$ into $X_1, X_2$, or $X_3$ will allow for the discovery that $X_3 \prec X_4$, while an edge from $Z$ into $X_4$ will allow for the discovery that $X_3 \preceq X_4$. 

We characterize full identifiability conditions for Alg.~\ref{alg:cb_oracle} as follows, with ${\bm X}_{\preceq i}$ standing for the set of all $X_i$'s observable non-descendants in $\mathcal G$, including $X_i$ itself. Without loss of generality, assume that sets are indexed such that no $X_i$ is a descendant of some $X_j$ for $j > i$.
%Given the limited scope of the oracle, it is also good to characterize a plausible and motivating situation where we can find the full causal ordering in $\mathcal G_X$. In what follows, let ${\bm X}_{\preceq i}$ be the set non-descendants of $X_i$ in $G_X$.

\begin{theorem}[Identifiability]
\label{thm:order_identify}
    The following conditions are sufficient for $\textsc{CBL-Oracle}$ to learn the total causal order of $\mathcal{G}_X$. 
    If $X_i \sim X_j$, then either (i) there is no active backdoor path between $X_i$ and $X_j$ given $\bm Z$ and their common ancestors in $\bm X$; or (ii) some $V_i \in {\bm X_{\preceq i}}$ is $d$-connected to $X_j$ given ${\bm X_{\preceq i}} \backslash \{V_i\}$, and some $V_j \in {\bm X_{\preceq j}}$ is $d$-connected to $X_i$ given ${\bm X_{\preceq j}} \backslash \{V_j\}$.
    If $X_i \prec X_j$, then either (iii) some $V \in {\bm X_{\preceq i}}$ is $d$-separated from $X_j$ given ${\bm X_{\preceq i}} \backslash \{V\}$; or (iv) there exists a nonempty set of mediators along a unidirectional path $X_i \prec X_{i+1} \prec \dots \prec X_{j-1} \prec X_j$ such that condition (iii) applies to each pair $\{X_k, X_{k+1}\}, k \in \{i, \dots, j-1\}$. 
\end{theorem}

% (iii) for each pair $\{X_i, X_j\}$ such that $X_i \preceq X_j$ in $\mathcal G$, $X_j$ has at least one adjacent vertex $V \in \bm{X}_{\preceq j}$ in $\mathcal G^{\backslash \bm U}$ that is not $d$-connected to $X_i$ given ${\bm X_{\preceq j}} \backslash \{V\}$.
Condition (i) above motivates the name \emph{confounder blanket}. 
%Note that some structures that cannot be learned via (R1)-(R3) may still be inferred using the transitivity and antisymmetry of ancestral relations (thence the $\textsc{Closure}$ function). See Appx. \ref{app:proofs} for details.

One of the key points, however, is the \emph{completeness} of our algorithm. In the nonparametric causal discovery literature, this is usually defined with respect to an oracle that delivers true answers to all conditional independence queries over observable variables. We define a new scope for completeness that places some reasonable limits on oracular omnipotence. First, we introduce the following definitions:

\begin{definition}[Iteration-$t$ known non-descendant]
Given an algorithm $\mathcal A$, we call vertex $W$ an \emph{iteration-$t$ known non-descendant} of a vertex $X$ if either (i) $W \in \bm Z$; or (ii) after $t$ modifications to $\mathbf{M}$ by $\mathcal A$, we have $W \preceq_{\mathbf{M}} X$. $\Box$
\end{definition}
%(i) $W \in \bm Z$; or (ii) after $t$ modifications to $\mathbf{M}$ by $\mathcal A$, the algorithm has deduced that $W \preceq X$. $\Box$


\begin{definition}[Lazy oracle algorithm]
Let $\bm{X}_{\preceq i}^t$ be the set of all iteration-$t$ known non-descendants of $X_i$ according to some algorithm $\mathcal A$. A \textit{lazy oracle algorithm} is one that starts with an uninformative ancestrality matrix $\mathbf{M}$ and updates at each round $t$ with answers to queries of just two types:
\begin{itemize}[noitemsep]
    \item [(i)] $W \indep X_i~|~{\bm S}_{ij\backslash W}^t \cup \phi(X_j)$, such that $W \in {\bm S}_{ij}^t$ and $\phi(X_j) \in \big\{\emptyset, \{X_j\}\big\}$; and
    \item[(ii)] $X_i \indep X_j~|~{\bm S}_{ij}^t$,
\end{itemize}
where $\{X_i, X_j\} \subseteq {\bm X}$ and ${\bm S}_{ij}^t := \bm{X}_{\preceq i}^t \cap \bm{X}_{\preceq j}^t$. $\Box$
\end{definition}
Our oracle may be clairvoyant when it comes to probabilistic relationships, but she is not quite as accommodating as her classical counterpart. 
In particular, she refuses to marshal her powers in service of combinatorial search strategies, which she considers tedious and inelegant.
%In particular, she cannot be bothered to marshal her powers in service of combinatorial search strategies, which she considers tedious and inelegant. 
Instead she bestows her favor upon us only when we limit ourselves to a more restrictive class of queries pertaining to independence relationships conditioned on the complete set of (known) non-descendants for any given pair of foreground variables.

Observe that inferences about ancestral relationships are fully ordered with respect to their information content: $\{\texttt{NA}\} \prec \{i \preceq j\} \prec \{i \prec j\} \sim \{i \sim j\}$. This motivates the following optimality target:

\begin{definition}[Dominance]
Among the set of all sound procedures for learning ancestral relationships, we say that algorithm $\mathcal{A}$ \textit{dominates} algorithm $\mathcal{B}$ iff $\mathcal{A}$ is strictly more informative than $\mathcal{B}$. That is, (i) there exists no pair of observable vertices in any DAG $\mathcal{G}$ such that $\mathcal{A}$'s output for that pair is less informative than $\mathcal{B}$'s; and (ii) there exists some pair of observable vertices in some DAG $\mathcal{G}$ such that $\mathcal{A}$'s output for that pair is more informative than $\mathcal{B}$'s. $\Box$
\end{definition}

Finally, we may state our completeness result.

\begin{theorem}[Completeness]\label{thm:completeness}
    No lazy oracle algorithm dominates {\sc CBL-Oracle}. That is, inferences returned by {\sc CBL-Oracle} are always at least as informative as those of any lazy oracle algorithm.
\end{theorem}

An immediate corollary of Thm.~\ref{thm:completeness} is that the identifiability conditions of Thm.~\ref{thm:order_identify} are not just sufficient but also \textit{necessary} with respect to a lazy oracle algorithm.

Of course, relationships of conditional independence are estimated from finite samples in practice. In the sequel, we consider practical methods for implementing an algorithm that is pointwise consistent under further assumptions about the nature of conditional independencies in $p({\bm z}, {\bm x})$.

\begin{comment}
\begin{definition}[Full-$\preceq$ algorithm]
Let $X_{\preceq i}^t$ be the set of all iteration-$t$ known non-descendants of $X_i$ according to some algorithm $\mathcal A$. A \emph{full}-$\preceq$ \textit{algorithm} $\mathcal A$ is one that starts with an uninformative ancestrality matrix $\mathbf{M}$ and updates at each round $t$ with oracle answers to queries of just two types:
\begin{itemize}
    \item [(i)] $W \indep X_i~|~{\bm S}_{ij\backslash \{W\}}^t \cup \phi(X_j)$, such that $W \in {\bm S}_{ij}^t$ and $\phi(X_j) \in \{\emptyset, \{X_j\}\}$; and
    \item[(ii)] $X_i \indep X_j~|~{\bm S}_{ij}^t$,
\end{itemize}
where $\{X_i, X_j\} \subseteq {\bm X}$ and ${\bm S}_{ij}^t := X_{\preceq i}^t \cap X_{\preceq j}^t$. $\Box$
\end{definition}

In other words, this class of algorithms can only use conditioning sets that contain \emph{all} known non-descendants of a given pair of foreground variables. Within this class of tractable algorithms, our proposed method has the following property.

\begin{theorem}[Completeness]\label{thm:completeness}
    Let $\mathcal A$ be any sound \emph{full}-$\preceq$ algorithm. If {\sc ConfounderBlanketLearner} returns {\tt NA} for some ${\mathbf{M}}_{ij}$, so will $\mathcal A$. If it returns $i \preceq j$ or $j \preceq i$, algorithm $\mathcal A$ will return the same (or possibly {\tt NA}).
\end{theorem}
\end{comment}

\section{Statistical Inference}\label{sec:inference}

In this section, we describe a practical method based on the oracle algorithm, called $\textsc{CBL-Sample}$, or simply CBL. Our main assumption to help bridge the gap between theory and practice is the following:
\begin{enumerate}[align=left]
    \item[(A4)] We have access to a regression algorithm by which we can test any pairwise conditional independence statement $X \indep Y~|~{\bm S}$ by regressing $Y$ on ${\bm S} \cup \{X\}$. The regression is implemented with a variable selection strategy which will, in the limit of infinite data, remove $X$ from the regression equation iff $X \indep Y~|~{\bm S}$.
\end{enumerate}

Statistical error control techniques are presented under this assumption. We will not discuss its validity for the specific sparse regression engines exploited here. This is well-understood in, for example, the case of Gaussian linear regression and a $z$-test of the coefficient for $X$. In other scenarios, due to computational or statistical reasons, this is less straightforward (e.g., lasso is ``sparsistent'' only under  restrictive assumptions, and it is possible to have a covariate dropping out of a population regression function even if the corresponding conditional independence does not hold \citep{hastie2015}). Instead, we take this foundational assumption as an idealization that simplifies analysis, being open about the fact that, in practice, such assumptions may only be approximately satisfied.

% What about (non-asymptotic) high probability guarantees instead of consistency? Maybe there's another way to lean on modularity that does not put the focus on how unrealistic the assumptions might be for any particular theorems about lasso or whatever method it is

Constraints like $W \indep X_j~|~{\bm A}_{\backslash W} \cup [X_i]$ suggest two regression models per triplet $(X_i, X_j, W)$: one for the regression of $X_j$ on $W$, ${\bm A}_{\backslash W}$ and $X_i$, and another for the regression of $X_j$ on $W$ and ${\bm A}_{\backslash W}$ only. In the algorithm that follows, we simplify this by using a single model to simultaneously test for all $W$, fitting a regression for $X_j$ on ${\bm A}$ and $X_i$, and another regression for $X_j$ on ${\bm A}$ only. These are clearly mathematically equivalent (as ${\bm A} = {\bm A}_{\backslash W} \cup \{W\}$), so long as the variable selection procedure in the regression model can be computed exactly, for instance when using $z$-tests for a Gaussian regression model or when lasso sparsistency conditions are satisfied. This will not necessarily be the case when an intractable combinatorial search underlies variable selection, or when conditions for a continuous relaxation do not hold. The safer alternative is, just like in the oracle algorithm, to perform individualized model selection for each $W$, without any concern for simultaneously selecting variables within ${\bm A}_{\backslash W}$. 

Nevertheless, for simplicity we rely on a joint variable selection procedure that uses all elements of $\bm{A}$ when fitting each regression model, and empirically show that bundling individual covariate tests achieves better results than existing alternatives. We emphasize that combinatorial search can be avoided altogether by separating selection on each $W$ from any sort of sparse regularization or search among the other covariates, if so desired.

\paragraph{Bipartite Subgraphs.}

We begin with the simplest case, in which we have just two foreground variables $\bm{X} = \{X, Y\}$. We fit a quartet of models to estimate the following conditional expectations: 
\begin{equation*}
    \begin{aligned}
    f^0_Y: &~\mathbb{E}[Y~|~\bm{Z}] \\
    f^1_Y: &~\mathbb{E}[Y~|~\bm{Z}, X]
    \end{aligned} \quad 
    \begin{aligned}
    f^0_X: &~\mathbb{E}[X~|~\bm{Z}] \\
    f^1_X: &~\mathbb{E}[X~|~\bm{Z}, Y],
    \end{aligned}
\end{equation*}
where subscripts index outcome variables and superscripts differentiate between full and reduced conditioning sets. Assume, for concreteness, that all structural equations are linear. Since some elements of $\bm{Z}$ may not influence $\bm{X}$, we estimate the members of this quartet using lasso regression, which performs automatic feature selection. This results in four different \emph{active sets} of predictors. For instance, the active set $\hat{\bm{S}}^0_Y(\lambda) \subseteq \bm{Z}$ picks out just those background variables that receive nonzero weight in the model $\hat{f}^0_Y$ at a given value of the regularization parameter $\lambda$ (though we generally suppress the dependence for notational convenience).

Our basic strategy is to refit the model quartet some large number of times $B$, taking different training/validation splits to get a sampling distribution over active sets. (The exact resampling method is described in more detail below.) This allows us to test the antecedent of (R3) by evaluating whether $X \in \hat{\bm{S}}^1_Y$ and $Y \in \hat{\bm{S}}^1_X$ with sufficient frequency. If the conjunction occurs fewer than $\gamma B$ times (with the convention that $\gamma = \sfrac{1}{2}$) we conclude that $X \sim Y$. Because we seek to minimize errors of commission, we are more conservative in our inference procedures for $\prec$ and $\preceq$ relations. From our distribution of active sets we calculate the (de)activation rate of each non-descendant with respect to a given causal ordering. This gives four unique rates per non-descendant, representing the (de)activation frequencies when treating either $X$ or $Y$ as the candidate cause. High rates are evidence that the corresponding inference rule applies.
%\begin{equation}
%    r_d(Z)_{X \preceq Y} := \#\{b: Z \in \bm{S}^{0(b)}_Y \land Z %\not\in \bm{S}^{1(b)}_Y\}/B.
%\end{equation}
%The activation rate can be similarly defined:
%\begin{equation}
%    r_a(Z)_{X \preceq Y} := \#\{b: Z \not\in \bm{S}^{0(b)}_X \land Z %\in \bm{S}^{1(b)}_X\}/B.
%\end{equation}

What is a reasonable threshold for drawing such an inference? It is not immediately obvious how to specify an expected null (de)activation rate without further assumptions on the data generating process. Rather than introduce some ad-hoc prior or sparsity constraint, we take an adaptive approach inspired by the stability selection procedure of \citet{Meinshausen2010}. Specifically, we use a variant of complementary pairs stability selection \citep{Shah2013}, which guarantees an upper bound on the probability of falsely selecting a low-rate feature at any given threshold $\tau$. The method is so named because, on each draw $b$, we partition the data into disjoint sets of equal size. Rates are estimated over all $2B$ subsamples.

Stability selection was originally conceived for controlling error rates in feature selection problems, primarily lasso regression. We adapt the procedure to accommodate our modified target, which is a conjunction of inclusion/exclusion statements rather than a single selection event. Specifically, we are interested in the probability of (de)activation under some fixed feature selection procedure $\hat{\bm{S}}$. We write:
\begin{align*}
    r_d(Z)_{X \preceq Y} &:= \mathbb{P}(Z \in \hat{\bm{S}}^0_Y \land Z \not\in \hat{\bm{S}}^1_Y)
\end{align*}
to denote the probability that feature $Z$ is deactivated w.r.t. $X \preceq Y$. Activation rates are analogously defined:
\begin{align*}
    r_a(Z)_{X \preceq Y} &:= \mathbb{P}(Z \not\in \hat{S}^0_X \land Z \in \hat{S}^1_X).
\end{align*}
For the opposite ordering, we simply swap active sets, using $\hat{\bm{S}}^0_X, \hat{\bm{S}}^1_X$ for deactivation, and $\hat{\bm{S}}^0_Y, \hat{\bm{S}}^1_Y$ for activation w.r.t. $X \succeq Y$.

\begin{definition}[Complementary pairs stability selection] Let $\{(\mathcal{D}_{2b-1}, \mathcal{D}_{2b}) \subseteq [n]: b \in [B]\}$ be randomly chosen independent pairs
% Josh: this independence is for different b's, yes? could be confusing otherwise
of sample subsets of size $\lfloor n/2 \rfloor$ such that $\mathcal{D}_{2b-1} \cap \mathcal{D}_{2b} = \{\emptyset\}$. For $\tau \in [0, 1]$, $\phi \in \{a, d\}$, and $\psi \in \{X \preceq Y, X \succeq Y\}$, the complementary pairs stability selection (CPSS) procedure is $\hat{\bm{H}}_{\tau, \phi, \psi} := \{k: \hat{r}_\phi(Z_k)_\psi \geq \tau\}$, with estimated rates given by:
\begin{equation*}
    \hat{r}_d(Z)_{X \preceq Y} := \#\{b: Z \in \hat{\bm{S}}^0_Y(\mathcal{D}_b) \land Z \not\in \hat{\bm{S}}^1_Y(\mathcal{D}_b)\}/2B 
\end{equation*}
% Josh: I find the \land notation weird to parse, could just be me though
for deactivation w.r.t. $X \preceq Y$, and 
\begin{equation*}
    \hat{r}_a(Z)_{X \preceq Y} := \#\{b: Z \not\in \hat{\bm{S}}^0_X(\mathcal{D}_b) \land Z \in \hat{\bm{S}}^1_X(\mathcal{D}_b)\}/2B
\end{equation*}
for activation w.r.t. the same ordering. Again, to estimate rates for the opposite ordering, we simply swap active sets as described above. $\Box$
%To calculate rates for the opposite ordering, we simply swap active sets for the two numerators, using $\hat{\bm{S}}^0_X$ and $\hat{\bm{S}}^1_X$ for deactivation, and $\hat{\bm{S}}^0_Y$ and $\hat{\bm{S}}^1_Y$ for activation w.r.t. $X \succeq Y$. 
\end{definition}
% Josh: I must be wrong that there's some trivial symmetry in the activation/deactivation rates, like d = 1 - a. But it's not obvious to me why that's wrong. Is that because I am stupid and didn't read something earlier, or might it be a useful remark?
For some $\theta < \tau$, let $\bm{L}_{\theta, \phi, \psi} := \{k: r_\phi(Z_k)_\psi \leq \theta\}$ denote the set of low-rate variable indices for some $\phi, \psi$. Our goal is to bound the expected number of low-rate features selected at a given threshold $\tau$, i.e. $\mathbb{E}[|\hat{\bm{H}}_{\tau, \phi, \psi} \cap \bm{L}_{\theta, \phi, \psi}|]$. Methods for doing so rely on certain assumptions about the distribution of rates for features within $\bm{L}_{\theta, \phi, \psi}$. \citet{Shah2013}'s tightest bound is achieved under $r$-concavity, formally defined in Appx. B. Roughly, $r$-concave distributions describe a continuum of constraints that interpolate between unimodality and log-concavity for $r \in [-\infty, 0]$. Simulation results suggest that (de)activation rates for low-rate features exhibit the following property (see Appx. B):
\begin{enumerate}[align=left]
    \item[(A5)] For all $Z \in \bm{L}_{\theta, \phi, \psi}$, empirical rates $\hat{r}_\phi(Z)_\psi$ are approximately $-1/4$-concave.
\end{enumerate}
We now have the following error control guarantee:
\begin{theorem}[Error control] \label{thm:err}
    The expected number of low-rate features selected by the CPSS procedure is bounded from above:
    \begin{align*}\label{eq:err}
    \begin{split}
        \mathbb{E}[|\hat{\bm{H}}_{\tau, \phi, \psi} \cap \bm{L}_{\theta, \phi, \psi}|] \leq \min\{D(\theta^2, 2\tau-1, B, -1/2), \\ D(\theta, \tau, 2B, -1/4)\}|\bm{L}_{\theta, \phi, \psi}|,
    \end{split}
    \end{align*}
    where $D(\theta, \tau, B, r)$ is the maximum of $\mathbb{P}(X \geq \tau)$ over all $r$-concave random variables supported on $\{0, \sfrac{1}{2B}, \sfrac{1}{B}, \dots 1\}$ with $\mathbb{E}[X] \leq \theta$. 
\end{theorem}
This is a direct application of \citet{Shah2013}'s Eq. 8. Though the bound is valid for all $\tau \in (\theta, 1]$, we apply an adaptive lower bound $\epsilon > \theta$, which denotes the minimum rate such that no conflicting inferences emerge, e.g. different ancestors deactivating for opposite causal orderings (see Alg. 5, Appx. C). We follow the authors' recommendations for default values of $B$ and $\theta$ (see Appx. B). We note that there is no closed form solution for $D(\theta, \tau, B, r)$, but the quantity is easily computed with numerical methods. If the number of (de)activations detected via this procedure exceeds the maximum error bound of Thm.~\ref{thm:err}, we infer that at least one must be a true positive. 
%For instance, if we find that 5 features are deactivated w.r.t. $X \preceq Y$ at $\tau = 0.6$, while the r.h.s. of Eq.~\ref{eq:err} is just 4, then at least one deactivation must be a true positive. 
Since even a single (de)activation is sufficient to partially order variable pairs, this licenses the corresponding inference.

% Josh: and that inference is? Something like "rejecting some H in favor of some other H at level tau controls this error rate (defined using the first H?)"? From one of our discussions I recall there are more than 2 hypotheses. If this isn't spelled out here I think another reader may not know how to interpret what exactly the inference is

\paragraph{General case.}

Our method can be expanded to accommodate larger sets of foreground variables and nonlinear structural equations. When $d_X > 2$, we simply
% Josh: lol "simply" --- true, but yet... (sorry I can't resist being a scold about certain math-speak conventions)
loop through all $d_X(d_X - 1)/2$ unique pairs of variables and record any inferences made at time $t=1$. Like in the oracle algorithm, as the set $\bm A$ grows for $t > 1$, we continue cycling through pairs that have yet to be unambiguously decided until no further inferences are forthcoming. Though we use lasso regression for linear systems in our experiments, stepwise regression or even best subset selection may be viable alternatives \citep{Hastie2020}. For nonlinear systems, we use gradient boosted regression trees with early stopping, which automatically adapt to signal sparsity \citep{friedman_gbm, buhlmann_l2boost}. Any function $s: \mathbb{R}^{d} \times \mathbb{R} \mapsto 2^d$ from input variables and outcome to an active set of predictors will suffice. Such feature selection subroutines may be consistent estimators for the Markov blanket of a given variable under fairly minimal conditions \citep[see, e.g.,][Prop. 1]{Candes2018}. 

In the worst case, CBL requires $\mathcal{O}(Bd_X^3)$ operations per feature selection subroutine $s$, the complexity of which itself presumably depends on $n, d_Z$ and $d_X$. For example, with $n > d = d_Z + d_X$, the least angle regression implementation of lasso takes $\mathcal{O}(d^3 + nd^2)$ computations \citep{efron_lars}, resulting in overall complexity of $\mathcal{O}\big(B(d_X^6 + nd_X^5 + d_Z^3)\big)$. More generally, if $s$ executes in polynomial time, then CBL is of complexity order P. Since constraint-based graphical learning without sparsity restrictions is NP-hard \citep{chickering2004}, this represents a major computational improvement. The procedure can be further sped up by parallelizing over subsamples, as these are independent. For pseudocode summarizing $\textsc{CBL-Sample}$, see Alg. 4 in Appx. C. 

\begin{comment}
\begin{align*}
    &\sum_{t=1}^{d} (d-t+1)(d-t)/2 \\
    &\sum_{t=1}^{d} (d^2 + t^2 - 2dt + d - t)/2 \\
    (Bd^3_X)(d^3 + nd^2) = Bd^6_X + (3d_x + n)Bd^5_X + (3+2n)Bd^4_xd_Z + (n+d_z)Bd^3_Xd^2_Z
\end{align*}
\end{comment}

\section{Experiments}\label{sec:exp}

\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.75\textwidth]{bivariate.pdf}
    \caption{Simulation results at varying sample sizes for three different structures: (a) $X \rightarrow Y$; (b) $X \dsep Y ~|~ \bm{S}$; and (c) $X \dsep Y ~|~ \bm{S} \cup [\bm{U}]$. We compare our CBL method to constraint- and score-based benchmarks. Expected results of an independence oracle are included at the far right.}
    %\vspace{-3mm}
    \label{fig:biv}
\end{figure*}

Full details of our simulation experiments are described in Appx. D. Briefly, we vary the sample size and dimensionality of the data, as well as graph structure and sparsity. Linear and nonlinear structural equations are applied at a range of different signal-to-noise ratios (SNRs).  

%\subsection{Bipartite Subgraphs}
\paragraph{Bipartite subgraphs.} We benchmark against a constraint-based method proposed by \citet{entner2013} and a score-based alternative similar in spirit to many causal discovery algorithms. We highlight two key differences between our proposal and \citet{entner2013}'s: (1) Their method assumes a partial order on foreground variables upfront. With the prior knowledge that $X \preceq Y$, it tests whether $X \rightarrow Y$ or $X \sim Y$, with the possibility that the disjunction is undecidable from the observational distribution. It therefore has an advantage in the following experiment, where the partial ordering assumption is satisfied, but competitors still consider the possibility that $X \leftarrow Y$. (2) The original version of \citet{entner2013}'s method performs combinatorial search through the space of non-descendants, which is infeasible in our setting. Following the authors' advice, we simplify the procedure by sampling random variable-subset pairs from $\bm{Z}$, evaluating conditional independence either via partial correlation (for linear data) or the generalized covariance measure \citep{shah2020} with gradient boosting subroutine (for nonlinear data).

%The former requires a partial order on foreground variables. The original method also performs combinatorial search through the space of non-descendants, which is infeasible in our setting. Following the authors' advice, we simplify the procedure by sampling random variable-subset pairs from the set of non-descendants, evaluating conditional independence either via partial correlation (for linear data) or the generalized covariance measure \citep{shah2020} with gradient boosting subroutine (for nonlinear data). Note that, while this method may return $\texttt{NA}$ if decision thresholds are not met, Entner et al.'s method can never infer $X \leftarrow Y$, since it requires the partial order $X \preceq Y$ as an input. 

For our score-based benchmark, we train a series of models to evaluate three different structural hypotheses, corresponding to (G1) $X \rightarrow Y$; (G2) $X \leftarrow Y$; and (G3) $X \sim Y$. We use lasso for linear data and gradient boosting for nonlinear data. We calculate the proportion of variance explained on a test set for all settings. If (G3) scores highest, we return $X \sim Y$. Otherwise, we test whether out-of-sample residuals for the top scoring model are correlated with the foreground predictor. If so, we return $\texttt{NA}$; if not, we return whichever of (G1) or (G2) scored highest.

We visualize results for the setting with 100 background variables and expected sparsity $\sfrac{1}{2}$ (see Fig.~\ref{fig:biv}). 
%, where sparsity is defined as the proportion of $\bm{Z}$-nodes with no edge into $\mathcal{G}_X$
Data are simulated from 100 random graphs drawn under three different structural constraints: (a) $X \rightarrow Y$; (b) $X \dsep Y ~|~ \bm{S}$, for some $\bm{S} \subseteq \bm{Z}$; and (c) $X \dsep Y ~|~ \bm{S} \cup [\bm{U}]$, where $\bm{U}$ denotes a set of latent confounders. The first two are identifiable, while the third is not. Linear and nonlinear structural equations are applied with SNR = 2.

We find that CBL fares well in all settings. Constraint-based methods show less power to detect edges when present in this experiment, especially in nonlinear systems, while score-based methods incur higher error rates when edges are absent. We also observe that the constraint-based procedure requires considerable tuning---we had to experiment with a five-dimensional grid of decision thresholds to get reasonable results---and is by far the slowest to execute, taking about five times longer than CBL even with the random subset approach.

%Oddly, performance degrades with $n$ for the constraint-based method in this setting, likely a byproduct of the algorithm's unconventional decision procedure for inferring the null hypothesis of conditional independence from large $p$-values. The score-based alternative does poorly here as well, falsely inferring a causal relationship between foreground variables in about three quarters of trials. Score-based methods do better in scenario (a), attaining 87\% power at $n=4000$. By contrast, CBL attains just 70\% power in the same test, but incurs far fewer errors of commission in settings (b) and (c). 

%\subsection{Larger Subgraphs}
\paragraph{Larger subgraphs.} We benchmark against two popular causal discovery algorithms: really fast causal inference (RFCI), a constraint-based method proposed by \citet{Colombo2012} as a more scalable version of the original FCI algorithm \citep{Spirtes2000}; and greedy equivalence search (GES), a score-based alternative due to \citet{Meek1997} and \citet{Chickering2003}. Both algorithms can be computed with background information to encode our partial ordering assumption, and restricted to focus on the subgraph $\mathcal{G}_X$. Despite its name, RFCI struggles to converge in reasonable time ($< 24$ hours) when $n=1000$ and $d_Z$ is on the order of 100, so we limit comparisons here to smaller datasets and run fewer replications for this method (5) than we do for GES (20). This illustrates how the assumption of extreme sparsity is necessary for RFCI to work in practice.

\begin{figure}[t!]
    \centering
    \includegraphics[width=\columnwidth]{multiv2.pdf}
    %\vspace{-5mm}
    \caption{Simulation results for our multivariate experiment, benchmarking against RFCI and GES. Whiskers represent standard errors.}
    %\vspace{-3mm}
    \label{fig:multiv}
\end{figure}

For this simulation, we draw random graphs of varying sample size with low (0.25) and high (0.75) sparsity, $d_Z \in \{50, 100\}$, and $d_X = 6$. Relationships are linear throughout, with RFCI using partial correlation tests for conditional independence and GES scoring edges according to BIC. Accuracy is measured with respect to all pairwise relationships for which a decision is reached. We find that CBL is more accurate on average in nearly all settings, with especially strong results in the high-sparsity, high-dimensionality regime. However, our method can be less stable than GES, as illustrated by the greater variance of results, particularly in dense networks where CBL outputs a relatively large number of $\texttt{NA}$s. 

\begin{figure}[t]
    \centering
    \includegraphics[width=\columnwidth]{ate.pdf}
    %\vspace{-5mm}
    \caption{Average treatment effects estimated by combining CBL with three different algorithms at varying SNRs.}
    %\vspace{-3mm}
    \label{fig:ate}
\end{figure}

\paragraph{Causal effects.} Since our method identifies admissible sets for all detected edges, we may estimate the average treatment effect (ATE) via backdoor adjustment. For this experiment, we simulate data from a partially linear model as originally parametrized by \citet{robinson_partially_linear}:
\begin{equation*}
    \begin{aligned}
    X &= f(\bm{Z}) + \epsilon_X, \\
    Y &= \beta X + g(\bm{Z}) + \epsilon_Y,
    \end{aligned} \quad 
    \begin{aligned}
    &\mathbb{E}[\epsilon_X~|~\bm{Z}] = 0, \\
    &\mathbb{E}[\epsilon_Y~|~\bm{Z}, X] = 0,
    \end{aligned}
\end{equation*}
with $X \in \{0, 1\}$ and $Y \in \mathbb{R}$. The goal is to estimate $\beta$, which corresponds to the ATE. We run our pipeline with three different estimators: double machine learning (DML) \citep{Chernozhukov2018}, inverse propensity weighting (IPW) \citep{rosenbaum_1983}, and targeted maximum likelihood estimation (TMLE) \citep{tmle2011}. For all three methods, models are fit with gradient boosting and parameters estimated via cross-fitting to avoid regularization bias. We simulate 1000 datasets with $\beta = 1, d_Z = 100, n = 10000$, and $\text{SNR} \in \{\sfrac{1}{2}, 1, 2\}$. We find that all three methods provide consistent ATE estimates, with TMLE generally performing best in terms of bias and variance (see Fig.~\ref{fig:ate}). This illustrates how CBL can be combined with existing algorithms to go beyond causal discovery and into causal inference.

%Let $Y(1), Y(0)$ denote potential outcome variables observed under the interventions $do(X=1)$ and $do(X=0)$, respectively. Then the average treatment effect (ATE) is given by $\beta = \mathbb{E}[Y(1) - Y(0)]$, and the conditional average treatment effect (CATE) is $\beta f(\bm{z}) = \mathbb{E}[Y(1) - Y(0)~|~\bm{z}]$. We estimate both quantities using a double machine learning technique \citep{Chernozhukov2018}, with $f$ and $g$ learned via gradient boosting. For comparison, we benchmark against two leading methods for CATE estimation, generalized random forests (GRF) \citep{Athey2019} and Bayesian causal forests (BCF) \citep{hahn_bart}. Both naturally provide ATE estimates by integrating over $\bm{Z}$. 


\begin{figure}[t]
    \centering
    \includegraphics[width=.49\columnwidth]{subnetwork.pdf}
    \caption{Estimated phosphocoline subnetwork in \textit{S. cerevisiae}. Nodes are genes, edges denote ancestral relations.}
    %\vspace{-3mm}
    \label{fig:subg}
\end{figure}

%\subsection{Biological Data}
\paragraph{Biological data.} As a final example, we consider regulatory mechanisms in \textit{Saccharomyces cerevisiae}. Our background variables are single nucleotide polymorphism (SNP) markers, with transcriptomic profiles serving as foreground variables. Such setups are common in expression quantitative trait loci (eQTL) studies, where the goal is to identify genetic sources of variation in mRNA expression. Our dataset spans 112 $\text{F}_1$ segregants, a cross of parental strains BY4716 and the wild isolate RM11-1a \citep{Brem2005}. This dataset has been analyzed by several groups, who have identified numerous regulatory mechanisms through a combination of statistical and experimental methods \citep{Brem2005b, storey2005, Chen2007}. 

We focus on the six genes that comprise the phosphocoline subnetwork, which regulates metabolic processes. The full set of background variables includes 3244 SNP markers, covering over 99\% of the genome. We examine cis-eQTL candidates for each pair of genes---here defined as markers within 5 kilobases of either on the same chromosome---meaning $d_Z$ is usually on the order of 500. We use lasso for feature selection. Results are visualized in Fig.~\ref{fig:subg}, where edges denote relations of ancestry rather than direct causation. These findings corroborate those of \citet{Chen2019}, who recently examined regulatory mechanisms in yeast, and inferred a phosphocoline subgraph that includes each ancestral relationship depicted above. Our method is conservative by comparison, perhaps due to the acyclicity assumption. \citet{Chen2019} infer several cycles (e.g., between ITR1 and MHO1) where CBL withholds judgment.

%(see Appx.~\ref{app:exp} for details)





%\subsection{Biological Example}
%As a real data example, we consider genotype-phenotype relationships in \textit{Mus Musculus}. \citet{Leduc2012} identify 5 expression quantitative trait loci (eQTLs) that regulate the cholesterol pathway in mice, represented by 9 relevant genes. We expand their dataset by adding 50 independent Boolean variables, simulating the effect of noisy single nucleotide polymorphisms (SNPs) from which eQTLs must be selected. The goal is to recover known features of the cholesterol gene regulatory network.

\section{Discussion}\label{sec:discussion}

We have proposed a novel method for learning ancestral relationships in downstream subgraphs based on the confounder blanket principle, which advises against combinatorial search for conditioning sets in cases where scale-free sparsity cannot be safely assumed. Our CBL algorithm is provably sound and lazy oracle-complete. Our sample version controls errors of commission with high probability and compares favorably to constraint- and score-based alternatives in a range of trials. In addition to accurately learning ancestral relationships, CBL identifies valid adjustment sets for causal effect estimation.

Completeness in causal discovery has traditionally been defined with respect to a classical independence oracle. In the context of statistical inference, this idealization serves a clear purpose, since there exists no uniformly valid test of conditional independence \citep{robins_consistency, shah2020}. Yet if our goal is simply to avoid the messiness of probabilistic reasoning in finite samples, then such oracles may overshoot the mark, for not only are they \textit{omniscient} about conditional independence relations---they are also \textit{omnipotent} with respect to computational complexity, able to scan through arbitrary subsets at no cost. We believe there are theoretical and practical advantages to decoupling these superpowers. Our lazy oracle is one example of how this may look, but others are also worth exploring.

We note several limitations of our method. First, CBL will struggle in the presence of weak edges. For instance, if the true graph is $\bm{Z} \rightarrow X \rightarrow Y$ and $I(X;Y) \gg I(X;\bm{Z})$, then conditioning on $Y$ in finite samples could deactivate some path(s) from $\bm{Z}$ to $X$, leading to the erroneous inference $Y \rightarrow X$. 
%Similarly, if the true structure is $\bm{Z} \rightarrow Y \leftarrow X$ and $I(X;Y) \gg I(Y;\bm{Z})$, then conditioning on $Y$ in finite samples may fail to activate the path(s) from $\bm{Z}$ to $X$. 
We observe that weak edges pose problems for all causal discovery procedures. Indeed, one motivation for taking an inclusive approach to background variables is the hope that a sufficiently large confounder blanket should include at least some strong edges that can be exploited to learn structural information about $\mathcal{G}_X$.  

%CBL relies on the assumption of faithfulness, which has been challenged by numerous authors \citep{Cartwright2001, Steel2006, zhang2008_unfaithful, Andersen2013, Uhler2013}. Several weaker variants have been proposed, including SGS-minimality \citep{Spirtes2000}, P-minimality \citep{Pearl2009}, frugality \citep{Forster2018}, and 2-adjacency faithfulness \citep{marx2021}. One direction for future work is to extend CBL under these relaxed assumptions.

CBL relies on the faithfulness assumption, which has been challenged by numerous authors \citep{zhang2008_unfaithful, Andersen2013, Uhler2013}. Several weaker variants have been proposed, including SGS-minimality \citep{Spirtes2000}, P-minimality \citep{Pearl2009}, and 2-adjacency faithfulness \citep{marx2021}. One direction for future work is to extend CBL under these relaxed assumptions.

The current implementation of CBL is order-dependent, insomuch as estimated subgraphs for the same dataset may vary if columns are reordered. This can be addressed using methods previously devised for constraint-based causal discovery \citep{colombo2014}.

%By relying on regression techniques to estimate Markov blankets, we generally substitute conditional \textit{expectations} for conditional \textit{probabilities}. Of course, conditional independence in expectation is necessary but not sufficient for conditional independence \textit{tout court}. If it is important to capture potential dependencies between higher moments, alternative feature selection methods may be required, e.g. parametric solutions based on weighted least squares or nonparametric functional regression techniques. 


\begin{acknowledgements} 
This work was supported by ONR grant 62909-19-1-2096. We thank Joshua Loftus for helpful comments on an earlier draft of this manuscript, and Michael Barnes for fruitful discussions on eQTL analysis.
\end{acknowledgements}

\bibliography{biblio}

%\appendix
%\input{appendix_A.tex}
%\input{appendix_B.tex}
%\input{appendix_C.tex}
%\input{appendix_D.tex}

\end{document}
