\section{Introduction}
Estimating causal effects from observational data can be challenging when treatments are not randomly assigned. While the task is doable under some structural assumptions \citep{rubin2005, shpitser2008, pearl2009causality}, most methods require access to data on potential confounders. This access cannot be generally guaranteed, since confounding variables may be unknown or difficult to measure. One common strategy for identifying causal effects under unobserved confounding relies on instrumental variables (IVs) \citep{wright1928, bt_iv, angrist1996}, which have a direct effect on the treatment but only an indirect effect on outcomes. For instance, single nucleotide polymorphisms (SNPs) often serve as IVs in genetic epidemiology, where they may be used to investigate the impact of phenotypes (e.g., cholesterol levels) on health outcomes (e.g., cancer) in Mendelian randomization studies \citep{smith2004, didelez2007, lawlor2008}.

\begin{figure}
    \centering
    \begin{tikzpicture} [node distance=10mm,>=stealth',sh/.style={shade}]
    \node [events] (U) [sh] {$U$} ;
    \node [events, below left = of U ] (X) {$X$};
    \node [events, below right = of U ] (Y) {$Y$};
    \node [events,  below   = of X ] (z2) {$Z_{j-1}$};
    \node [events,  left = of z2 ] (z1) {$Z_1$};
    \node [events,  below  = of Y ] (z3) {$Z_j$};
    \node [events,  right = of z3 ] (z4) {$Z_{d_{\Vector{Z}}}$};
    \draw [->] (U) to (X);
    \draw [->] (U) to (Y);
    %\draw [->] (U) to [out=-150, in=60]  (X);
    %\draw [->] (U) to [out=-30, in=120]  (Y);
    \draw [->] (X) to  (Y);
    \draw [->] (z1) to  (X);
    \draw [->] (z2) to  (X);
    \draw [->] (z3) to  (X);
    \draw [->] (z4) to  (X);
    \draw [dashed, ->] (z3) to (Y);
    \draw [dashed , ->] (z4) to  (Y);
    \draw [white](z2) to node[midway, black] {. . .} (z1);
    %\draw [white](z3) to node[midway, black] {. . .} (z2);
    \draw [white](z4) to node[midway, black] {. . .} (z3);
    \end{tikzpicture}
    \caption{Causal diagram with treatment $X$, outcome $Y$, unobserved confounder $U$ (shaded), and candidate instruments $Z_1, \dots, Z_{d_{\Vector Z}}$. Dashed edges suggest possible violations of the exclusion criterion. Edges among $\Vector Z$ are allowed, but omitted for simplicity.}\vspace{-3mm}
    \label{fig:dag}
\end{figure}

The IV model relies on three core conditions, formally defined in Sect.~\ref{sec:problem}. 
Informally, we may describe IVs as variables that are (A1) \emph{relevant}, i.e. associated with the treatment; (A2) \emph{unconfounded}, i.e. independent of common causes between treatment and outcome; and (A3) \emph{exclusive}, i.e. only affect outcomes through the treatment. 
Under some restrictions on structural equations, IVs can be used to  recover causal effects despite the presence of unobserved confounding. 
Popular IV methods include two-stage least squares \citep{angrist1995}, as well as nonparametric extensions based on conditional moment restrictions \citep{newey2003, newey2013, bennet2019} and more recent works exploiting kernel regression \citep{singh2019kernel, muandet2020, zhang2023} and neural networks \citep{hartford2017deep, xu2021learning, sorowit2022}. 

Of the three core conditions that characterize the IV setup, only (A1) can be immediately evaluated via observables.
%only (A1) is empirically falsifiable. 
(A2) fails if the latent confounder between treatment and outcome is also a parent of some proposed IV(s).
To continue with the Mendelian randomization example, this issue can arise when nearby variants are correlated, a phenomenon known as \emph{linkage disequilibrium} \citep{Reich2001}. (A3) fails if the proposed IV is a direct cause of the outcome (see Fig.~\ref{fig:dag}). 
This can happen with complex traits in genetics, where one gene affects multiple seemingly unrelated systems through a process called \emph{horizontal pleiotropy} \citep{Solovieff2013}. 
If IV methods are na{\"i}vely applied in either case, resulting inferences can be severely biased \citep{VanderWeele2014}.

Since valid IVs may be impossible to identify \emph{a priori}, several authors in recent years have proposed methods to estimate causal effects given just a set of candidate IVs (see Sect.~\ref{sec:related}). 
Details vary, but the goal is almost always to recover point estimates for the average treatment effect (ATE), possibly with associated confidence or credible intervals. 
We set a strictly more general target, relaxing (A3) to recover nontrivial bounds on this parameter, i.e. to \textit{partially} identify the ATE. 
There is a long tradition of analytic and Bayesian methods for partial identification in IV models \citep{Manski1990,chickering96,Balke1997}, as well as more recent works that exploit the flexibility of stochastic gradient descent \citep{kallus_zhou_2020, kilbertus2020class, hu2021}. 
Generally, a set of valid IVs is presumed---although some authors have considered the case where a single instrument is allowed to have a small effect on the outcome \citep{ramsahai2012, Conley2012, silva2016}.

We propose a novel procedure for bounding causal effects in settings where the exclusion criterion (A3) may not hold. 
Our method takes a set of \emph{leaky instruments}, which are permitted to violate (A3) to some limited degree, and uses them to minimize confounding effects on the causal pathway of interest. 
Focusing on linear structural equation models (SEMs), we derive partial identifiability conditions for the ATE with access to leaky instruments, and use them to formulate a convex optimization objective. 
Resulting bounds are provably \textit{sharp}---that is, they cannot be improved without further assumptions---and practically useful, providing causal information in many settings where classical methods fail. 
Finally, we propose a statistical test for exclusion and implement a generic bootstrapping procedure with coverage guarantees for estimated bounds.

The rest of this paper is structured as follows. 
We introduce the leaky IV model in Sect. \ref{sec:problem}. 
We present formal results in Sect. \ref{sec:theory} and experimental results in Sect. \ref{sec:experiments}. 
Following a review of related work in Sect. \ref{sec:related}, we discuss limitations and generalizations of our method in Sect. \ref{sec:discussion}. 
We conclude in Sect. \ref{sec:conclusion} with a summary and future directions. 

\begin{figure}
    \centering
    \begin{tikzpicture}
    [node distance=12mm,>=stealth',sh/.style={shade}]
    \node [events] (U) [white] {U} ;
    \node [events, below left = of U ] (X) {$X$};
    \node [events, below right = of U ] (Y) {$Y$};
    \node [events,  left = of U ] (eX) [sh] {$\epsilon_x$};
    \node [events,  right = of U ] (ey) [sh] {$\epsilon_y$};
    \node [events,  below right  = of X ] (Z) {$\Vector{Z}$};
    %\draw [dashed,<->] (eX) to [out=30, in=150] node[midway, above] {$\rho$} (ey);
    \draw [<->] (eX) to [out=30, in=150] node[midway, above] {$\rho$} (ey);
    \draw [->] (eX) to node[midway, left] { } (X);
    \draw [->] (ey) to node[midway, right] { } (Y);
    \draw [->] (X) to node[midway, above] {$\theta$} (Y);
    \draw [->] (Z) to node[midway, left] {$\Vector{\beta}$} (X);
    \draw [dashed, ->] (Z) to node[midway, right] {$\Vector{\gamma}$} (Y);
    \end{tikzpicture}
    \caption{Causal diagram of the SEM described by Eqs.~\ref{eq:scmx}-\ref{eq:scmSigma}. Edge weights correspond to linear coefficients, 
    while unobserved confounding effects are represented by the bidirected edge connecting $\epsilon_x$ and $\epsilon_y$. 
        % while confounding effects are represented by the bidirected edge connecting $\epsilon_x$ and $\epsilon_y$. 
    % The dashed edge from $\Vector{Z}$ to $Y$ once again denotes possible violations of (A3).}\vspace{-3mm}
        The dashed edge from $\Vector{Z}$ to $Y$ denotes possible violations of (A3).}\vspace{-3mm}
        \label{fig:sem}
\end{figure}

\section{Problem Setup}\label{sec:problem}

\paragraph{Notation.}
%\subsection{Notation}\label{sec:notation}
% symbol for scalar: $\Scalar{a}$\\
% symbol for vector: $\Vector{v}$\\
% symbol for matrix: $\Matrix{M}$\\
We denote individual variables with capital italic letters (e.g., $X$) and bundled sets of variables in boldface capital italics (e.g., $\Vector{Z} = \{Z_j\}_{j=1}^{d_{\Vector{Z}}}$). 
We use square brackets to indicate set enumeration, e.g. $[d_{\Vector{Z}}] = \{1, \dots, d_{\Vector{Z}}\}$.
Parameters are symbolized as Greek letters, with boldface for vectors (e.g., $\Vector{\beta}$) and boldface capitals for matrices (e.g., $\SSigma$). 

Standard notation for (co)variances can sometimes be confusing.  
Here we use the capital $\SSigma$ for any such quadratic expectation, regardless of whether it is a scalar, vector, or matrix---the subscripts will contain all the necessary information as to its dimensions.  
For instance, we write $\SSigma_{xy}$ for $\text{Cov}(X,Y)$ (as opposed to $\sigma_{xy}$) and $\SSigma_{xx}$ for $\text{Var}(X)$ (as opposed to $\sigma_{x}^2$). 
As a convenient byproduct, the notation generalizes more naturally to vector-valued variables. 
%$X$ and $Y$.  
%Additionally, we use bold text for bundled variables (e.g., vectors and matrices). 
%We use shorthand for conditional variances and covariances, which in our linear setting can be written:
%\begin{align*}
%    \SSigma_{xx \mid \bm{z}} &:= \text{Var}(X \mid \bm{Z}) = \SSigma_{xx} - \SSigma_{x\bm{z}} \SSigma_{\bm{zz}}^{-1} \Sigma_{\bm{z}x}\\
%    \SSigma_{xy \mid \bm{z}} &:= \text{Cov}(X, Y \mid \bm Z) = \SSigma_{xy} - \SSigma_{x\bm{z}} \SSigma_{\bm{zz}}^{-1} \Sigma_{\bm{z}y}.
%\end{align*}

\paragraph{The Leaky IV Setting.}
Consider a linear SEM with treatment $X \in \mathbb{R}$, outcome $Y \in \mathbb{R}$, and a set of candidate IVs $\Vector{Z} \in \mathbb{R}^{d_{\Vector{Z}}}$. 
Assume that all variables have mean 0 and finite variance.
Data are generated according to the following process (see Fig.~\ref{fig:sem}): 
% (see Fig.~\ref{fig:sem}): % (Figs.~\ref{fig:dag} and \ref{fig:sem})
\begin{align}
    X &= \Vector{\beta}\DotProd\Vector{Z} + \epsilon_x \label{eq:scmx} \\
    Y &= \Vector{\gamma}\DotProd\Vector{Z} + \theta X + \epsilon_y \label{eq:scmy} \\
    \SSigma_{\epsilon \epsilon} &= \begin{bmatrix}
        \eta_x^2&\rho\eta_x\eta_y\\\rho\eta_x\eta_y&\eta_y^2
    \end{bmatrix},\label{eq:scmSigma}
\end{align} % eq:scmx,eq:scmy,eq:scmSigma
where $\theta \in \mathbb{R}$ and $\Vector{\beta}, \Vector{\gamma} \in \mathbb{R}^{d_{\Vector{Z}}}$ are linear weights; 
and $\epsilon_x, \epsilon_y \in \mathbb{R}$ are residuals with mean $0$, standard deviations $\eta_x, \eta_y \geq 0$, and correlation $\rho \in [-1, 1]$.
%covariance $\SSigma_{\Vector{\epsilon}}$, which involves the (\textit{a priori} unknown) correlation coefficient $\rho \in [-1, 1]$. 
This latter parameter $\rho$ quantifies the magnitude and direction of unobserved confounding. 
% This latter parameter quantifies the magnitude and direction of unobserved confounding. 
%We make no assumptions about the distribution of $\Vector{Z}$ (besides finite second moments). 
To better interpret results, we assume all $Z$'s are on roughly the same scale (e.g., standardized to unit variance). We make no further assumptions about the distribution of $\Vector{Z}$.
Our goal is to bound $\theta$, which denotes the average treatment effect (ATE) of $X$ on $Y$. 

In the classical nonparametric IV setting, we have a set of unobserved confounders $\Vector{U} \in \mathbb{R}^{d_{\Vector{U}}}$ with direct effects on both $X$ and $Y$. Then $\Vector{Z}$ is a set of \emph{valid instruments} if and only if the following conditions are satisfied:\footnote{The ``exogeneity'' assumption in econometrics is sometimes equated with (A2), and sometimes with the conjunction of (A2) and (A3) (see, e.g., \citep[Ch.~15]{wooldridge2009introductory}). We avoid all talk of exogeneity to avoid confusion.}
%The assumptions of the classical IV model may be formalized as follows. 
%Let $\Vector{U} \in \mathbb{R}^{d_{\Vector{U}}}$ be a set of unobserved confounders with direct effects on both $X$ and $Y$, such that $\epsilon_x \indep \epsilon_y \mid \Vector{U}$. 
\begin{itemize}[noitemsep, itemindent=2em]
    \item[(A1)] \emph{Relevance:} $\Vector{Z} \dep X$
    \item[(A2)] \emph{No confounding:} $\Vector{Z} \indep \Vector{U}$
    \item[(A3)] \emph{Exclusion criterion:} $\Vector{Z} \indep Y~|~\{X, \Vector{U}\}$.
\end{itemize}
Adapting these assumptions to a linear SEM, we posit that $\Vector U$ is a parent of both noise variables satisfying $\epsilon_x \indep \epsilon_y \mid \Vector{U}$ and equate (conditional) independence with (conditional) covariance of zero. Under these conditions, we may compute treatment effects via two-stage least squares (2SLS) \citep{bt_iv}.
For this procedure, we solve Eq.~\ref{eq:scmx} with ordinary least squares (OLS) and substitute fitted values from this model for $X$ in Eq.~\ref{eq:scmy}, which is in turn solved via OLS. 
The resulting $\hat{\theta}^{\text{2SLS}}$ is our ATE estimate. 
Note that (A3) implies that $\lVert \Vector{\gamma} \rVert = 0$, since in this case each $Z_j \in \Vector{Z}$ receives zero weight in Eq.~\ref{eq:scmy}. 
%Note that (A3) implies that any norm of $\Vector{\gamma}$ is zero since in this case each $Z_j \in \Vector{Z}$ receives zero weight in Eq.~\ref{eq:scmy}. % what about this?
% Note, however, that (A3) implies that $\lVert \Vector{\gamma} \rVert_0 = 0$, i.e. that each $Z_j \in \Vector{Z}$ receives zero weight in Eq.~\ref{eq:scmy}. 

%When classical assumptions fail, alternative methods are needed. 
We relax the exclusion criterion and consider two modified variants using scalar or vector-valued thresholds.
\begin{itemize}[noitemsep, itemindent=2em]
    \item[(A$3'_s$)] \emph{Scalar $\tau$-exclusion:} $\lVert \Vector{\gamma} \rVert_p \leq \tau$
    %\item[(A$3'_v$)] \emph{Vector $\tau$-exclusion:} $\lVert \Vector{\gamma}/
    %\Vector{\tau} \rVert_p \leq 1$.%
    \item[(A$3'_v$)] \emph{Vector $\tau$-exclusion:} $\forall j \in [d_{\Vector{Z}}]: |\gamma_j | \leq \tau_j$.
\end{itemize}
That is, we allow $\Vector{Z}$ to have some direct effect on $Y$ but restrict this influence either by placing an upper bound on the $L_p$-norm of the $\Vector{\gamma}$ coefficients (scalar-valued $\tau$) or by placing separate thresholds on the magnitude of each individual coefficient (vector-valued $\Vector{\tau}$). 
%on a rescaled version $\tilde{\Vector{\gamma}}=\Vector{\gamma}/\Vector{\tau}$ coefficient (vector-valued $\Vector{\tau}$). 
We derive sharp ATE bounds for both cases, as well as a closed form solution under (A$3'_s$) when $p=2$.
%Though our experiments focus on cases where $p \in \{1, 2\}$, our Thm. \ref{thm:ate_bounds} provides exact solutions for any $p \geq 1$. 
%We also consider more general conditions on $\Vector{\gamma}$, including unique restrictions on each $\gamma_j \in \Vector{\gamma}$, and derive an efficient algorithm for approximate solutions under such constraints.  

We call variables that satisfy (A1), (A2), and either form of $\tau$-exclusion \emph{leaky instruments}. 
These features are technically observed confounders (at least those with nonzero $\gamma$ coefficients). 
Were it not for the unobserved confounding induced by $\rho$, causal effects could be calculated by integrating over $\Vector{Z}$, as in the backdoor adjustment \citep[Ch.~3.3]{pearl2009causality}. 
Unfortunately, this option is unavailable when $\rho \neq 0$. 
By exploiting known leakage threshold(s), however, we show how to recover sharp bounds on the ATE.

Unlike other methods designed to accommodate potential violations of the exclusion criterion (see Sect.~\ref{sec:related}), we do not assume that some proportion of candidate IVs are valid, or that biases introduced by direct links from $\Vector{Z}$ to $Y$ cancel out. 
On the contrary, we explicitly allow for a dense set of nonzero $\Vector{\gamma}$ weights, provided they satisfy some form of $\tau$-exclusion. 
As our experiments below demonstrate, this method naturally accommodates sparse $\Vector{\gamma}$ vectors without presuming them upfront.



%We may now informally describe our optimization objective: to find the minimal and maximal values of the ATE such that structural assumptions are satisfied, and the direct signal from $\bm{Z}$ to $Y$ is bounded. Optimization is with respect to the degree of unobserved confounding, as we explain in the sequel.

\section{Theory}
\label{sec:theory}
In this section, we show how to partially identify the ATE under bounded violations of the exclusion criterion and propose  methods for statistical inference.

\subsection{Scalar $\tau$-exclusion}
\label{sec:scalar_tau}
We begin with a scalar threshold on information leakage from $\Vector{Z}$ to $Y$.
As a first pass, we may formalize our objective as follows:
\begin{align*}
    \underset{\Vector{\beta}, \Vector{\gamma}, \theta, \eta_x, \eta_y, \rho} {\text{min/max}} \quad &\theta\nonumber\\
    \text{s.t.} \quad & \SSigma_{\mathcal{M}} = \SSigma,\\ 
    &\eta_x \geq 0, \eta_y \geq 0, -1 \leq \rho \leq 1, \lVert \Vector{\gamma} \rVert_p \leq \tau,
\end{align*}
where $\SSigma$ is the observational covariance matrix of $\{X, Y, \Vector{Z}\}$ and $\SSigma_{\mathcal{M}}$ is the model covariance matrix implied by Eqs.~\ref{eq:scmx}, \ref{eq:scmy} and \ref{eq:scmSigma}.
% where $\SSigma$ is the observational covariance matrix of $\{X, Y, \bm{Z}\}$ and $\Sigma_M$ is the model covariance matrix implied by Eqs.~\ref{eq:scmx}, \ref{eq:scmy} and \ref{eq:scmSigma}.
\begin{comment}
    \begin{align*}
    \underset{\Vector{\beta}, \Vector{\gamma} \in \mathbb{R}^{d_{\!\Vector{Z}}}} {\text{min/max}} \quad &\theta\nonumber \quad\\
    \text{s.t.} \quad &\text{Eqs.}~\ref{eq:scmx}, \ref{eq:scmy}, \ref{eq:scmSigma} ~\text{hold; and}\\
    &\text{Assumptions}~(\text{A}1), (\text{A}2), (\text{A}3') ~\text{hold}.
\end{align*}
\textcolor{red}{[RICARDO: I would write it like this:
\begin{align*}
    \underset{\Vector{\beta}, \Vector{\gamma}, \eta_x, \eta_y, \rho, \theta} {\text{min/max}} \quad &\theta\nonumber \quad\\
    \text{s.t.} \quad & \Sigma_M = \Sigma ~\text{and}\\
    &\eta_x^2 \geq 0, \eta_y^2 \geq 0, ||\gamma||_p \leq \tau,
\end{align*}
\noindent where $\Sigma$ is the observational covariance matrix of $\{X, Y\} \cup Z$ and $\Sigma_M$ is the model covariance matrix implied by Eqs.~\ref{eq:scmx}, \ref{eq:scmy} and \ref{eq:scmSigma}.
]}
\end{comment}

Though technically correct, this formulation is unnecessarily complex. It suggests a potentially high-dimensional constrained optimization problem that is not obviously amenable to polynomial programming techniques. 
To simplify it, we provisionally assume access to the population covariance matrix $\SSigma$. (We discuss methods for estimating these parameters in Sect.~\ref{sec:cvg}.) 
% (Methods for estimating these parameters are reviewed in Sect.~\ref{sec:cvg}.) 
This allows us to solve directly for $\Vector{\beta}$ and $\eta^2_x$: %
\begin{align*}
    \Vector{\beta} = \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}, \quad \eta^2_x = \SSigma_{xx} - \Vector{\beta} \DotProd \SSigma_{x\Vector{z}}.
\end{align*}
% \begin{align*}
%     \Vector{\beta} = \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}, \quad \eta^2_x = \SSigma_{xx} - \Vector{\beta} \SSigma_{\Vector{z}x}.
% \end{align*}
% With these parameters fixed, all remaining coefficients are rendered deterministic functions of $\rho$, with some special care for the non-negativity constraint on $\eta_y$. 
With these parameters fixed, the remaining coefficients $\theta,\Vector{\gamma}$ are rendered deterministic functions of $\rho$, with some special care for the non-negativity constraint on $\eta_y$. 
To see this, it helps to define the scalars:
\begin{comment}
\begin{align*}
    a &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}x} - \SSigma_{xx}\\
    b &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}y} - \SSigma_{xy}\\
    c &:= \SSigma_{y\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}y} - \SSigma_{yy}. 
\end{align*}
These values can be estimated directly from the data.
 
%Another way to say this: $b = \text{Var}(X\theta) + \eta^2_y$.
%Another way: $a$ is like $H(Y)$ - I(Y; Z), subtracting both direct (Z -> Y) and indirect (Z -> X -> Y) influences from Z; $b$ is like $I(Y; X)$ minus the $Z -> Y$ influence.
% The OLS thing follows from the fact that with with OLS estimators, we have:
\begin{align*}
    \bm{\beta}^\top \SSigma_{\bm{zz}} \bm{\beta} = \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}x}.
\end{align*}

\end{comment}
\begin{align*}
    \KappaXX &:=  \SSigma_{xx} - \SSigma_{x\Vector{z}} \DotProd \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x} = \eta_x^2\\
    \KappaXY &:=  \SSigma_{xy} - \SSigma_{x\Vector{z}} \DotProd \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}y} \\
    % = \text{Cov}(X,Y|\Vector{Z}).
    \KappaYY &:=  \SSigma_{yy} - \SSigma_{y\Vector{z}} \DotProd \SSigma_{\Vector{z} \Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}y} 
    %= \text{Var}(Y|\Vector{Z}) \\
\end{align*}
% These terms correspond to the conditional variance of $Y$ given $\Vector{Z}$, and the conditional covariance of $X$ and $Y$ given $\Vector{Z}$, respectively. 
These terms correspond, respectively, to the conditional variance of $X$ given $\Vector{Z}$ ($\KappaXX$), 
the conditional covariance of $X$ and $Y$ given $\Vector{Z}$ ($\KappaXY$), 
and the conditional variance of $Y$ given $\Vector{Z}$ ($\KappaYY$). 
Thus, by the Cauchy-Schwarz inequality, $\KappaXX\KappaYY \geq \KappaXY^2$. 
%In the classic IV model, $\phi^2$ reduces to $\SSigma_{yy}$ and $\psi$ reduces to $\SSigma_{xy}$. 
% Note that in the classic IV model, $\KappaYY$ reduces to $\SSigma_{yy}$ and $\KappaXY$ reduces to $\SSigma_{xy}$. 
%The first quantity represents the total variance of $Y$ not explained by $\bm Z$, i.e. the variance induced by the treatment, the noise $\epsilon_y$, and the confounding between them. In the classic IV model, $\phi^2$ reduces to $\SSigma_{yy}$.
%The second quantity represents the total covariance of $X$ and $Y$ not explained by $\bm Z$, i.e. the covariance induced by the direct link from treatment to outcome as well as the latent confounding between them. In the classic IV model, $\psi$ reduces to $\SSigma_{xy}$. 
% Note that $\phi$, like the other conditional standard deviation terms $\eta_x, \eta_y$, is non-negative by definition. 
% Note that $\phi$, like the other conditional standard deviation terms $\eta_x, \eta_y$, is non-negative by definition. 
% Further, it satisfies the inequality $\KappaXX\KappaYY \geq \KappaXY^2$.
% Note that $\KappaXX\KappaYY \geq \KappaXY^2$.
% With these definitions at hand, 
% we may now characterize the relationship between $\rho$ and $\theta$. (See Appx.~\ref{appx:proofs} for all proofs.)
With these definitions in hand, 
we are now ready to characterize the relationship between $\rho$ and $\theta$. (See Appx.~\ref{appx:proofs} for all proofs.)
\begin{lemma}[\textit{ATE as a function of confounding}]
\label{lemma:theta}
    There is a bijective, strictly decreasing function $f: [-1, 1] \mapsto \mathbb{R}$ that maps values of the confounding coefficient $\rho$ to the ATE $\theta$:
    %\begin{align*}
    %    \theta = f(\rho) = \frac{b}{a} - \text{sgn}(\rho) \frac{\sqrt{(1 - 1 / \rho^2) (b^2 - ac)}}{a(1 - 1/\rho^2)}
   % \end{align*}
    %\begin{align*}%\label{eq:theta_formula}
    %    \theta = f(\rho) = \frac{\psi}{\eta^2_x} - \text{sgn}(\rho) \frac{\sqrt{(1 - 1 / \rho^2) (\psi^2 - \phi^2\eta^2_x)}}{-\eta^2_x(1 - 1/\rho^2)}.
    %\end{align*}
    %\begin{align*}
    %\begin{align*}
    %    \theta &= \KappaXX^{-1} \Big( \KappaXY - \sqrt{\KappaXX\KappaYY - \KappaXY^2} \tan\kern-1pt\big(\kern-1pt\arcsin(\rho)\big)\kern-1pt\Big)\\
    % &=: f(\rho).
    %\end{align*}
    \begin{align*}
        \theta = f(\rho) := \KappaXX^{-1} \Big( \KappaXY - \sqrt{\KappaXX\KappaYY - \KappaXY^2} \tan\kern-1pt\big(\kern-1pt\arcsin(\rho)\big)\kern-1pt\Big).
    \end{align*}
    %\begin{align*}
    %    \theta = f(\rho) = \frac{1}{\KappaXX} \Bigg(\KappaXY - \rho \sqrt{\frac{\KappaXX\KappaYY - \KappaXY^2}{1 - \rho^2}}\Bigg).
    %\end{align*}
\end{lemma}
  % \begin{align*}
  %       \theta = f(\rho) = \frac{\psi}{\eta^2_x} - \text{sgn}(\rho) \frac{\sqrt{(1 - 1 / \rho^2) (\psi^2 - \phi^2\eta^2_x)}}{-\eta^2_x(1 - 1/\rho^2)}.
  %   \end{align*}
This function takes the shape of a rotated sigmoid (see Fig. \ref{fig:lemmas}A). 
Note that when $\rho = 0$, there is no unobserved confounding and $\theta$ can simply be estimated by OLS (if $\lVert \Vector{\gamma} \rVert = 0$) or backdoor adjustment on $\Vector{Z}$ (if $\lVert \Vector{\gamma} \rVert > 0$).

%Note that if we dropped the sgn function and allowed for alternative quadratic solutions---i.e., greater roots for positive confounding and lesser roots for negative confounding---we would violate the nonnegativity condition for $\eta_x$. See Appx. BLAH. 
\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/lemmas.pdf}
  \vspace{-2mm}
  \caption{Example curves illustrating the relationships between parameters in the leaky IV model. 
  \textbf{(A)} A $\rho$-$\theta$ curve maps the relationship between latent confounding and causal effects. 
  \textbf{(B)} A $\theta$-$\lVert \Vector{\gamma} \rVert_2$ curve maps the relationship between causal effects and information leakage. 
  Shading represents 95\% confidence intervals estimated via the bootstrap.}
  \vspace{-2mm}
  \label{fig:lemmas}
\end{figure}

% Because strong confounding induces extreme values of $\theta$, more informative bounds on the ATE can be derived if further knowledge about the problem allows us to truncate the range of $\rho$. 
Because strong confounding induces extreme values of $\theta$, more informative bounds on the ATE can be derived if subject matter knowledge allows us to truncate the range of $\rho$. 
This is precisely what (A$3'_s$) achieves, although the exact form of this truncation depends on our choice of norm. 
To see how $\tau$-exclusion restricts $\rho$'s range, we must spell out the relationship between the ATE and leaky weights $\Vector{\gamma}$.
\begin{lemma}[\textit{Leakage as a function of ATE}]
\label{lemma:gamma}
    There is a surjective function $g_p: \mathbb{R} \mapsto \mathbb{R}_{\geq 0}$ that maps values of the ATE $\theta$ to the $L_p$ norm of the leakage weights $\Vector \gamma$:
    \begin{align*}%\label{eq:gamma_formula}
        \lVert \Vector{\gamma} \rVert_p = g_p(\theta) := \lVert \Vector{\alpha} - \theta \Vector{\beta} \rVert_p,
    \end{align*}
    where $\Vector{\alpha} := \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}y}$ represents the expected weights of an OLS regression of $Y$ on $\Vector Z$. 
\end{lemma}
For the special case of $p=2$, this function is quadratic (see Fig. \ref{fig:lemmas}B). 
Recall that the $L_p$ norm is convex for all $p \geq 1$ and strictly convex for $p \in (1, \infty)$.
Though everywhere differentiable for $p \in [2, \infty)$, the norm may be non-differentiable at countably many points for $p \in [1, 2)$. 
%Note that while it is easy to evaluate this expression for $p < 1$, the resulting space is not metric (as it violates the triangle inequality) and $\lVert g(\theta) \rVert_p$ is rendered non-convex (see Appx. BLAH for an example).

The leakage threshold $\tau$ defines a feasible region of possible models. 
While we presume that this parameter is provided upfront (more on this in Sect.~\ref{sec:discussion}), it cannot be made arbitrarily small in the leaky IV setting. 
Specifically, the lower bound corresponds to particular values of $\theta$ and $\rho$. 
\begin{lemma}[\textit{Minimum leakage as a function of ATE}]\label{lemma:theta_star}
    %\begin{align*}
    %    \Vector{\alpha} := \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}y}, \quad \Vector{\beta} := \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}.
    %\end{align*}
    The minimum degree of leakage consistent with the data can be obtained by solving the following linear regression task in $L_p$ space:
    \begin{align*}
        \check{\theta}_p := \argmin_{\theta \in \mathbb{R}} ~g_p(\theta). 
    \end{align*}
\end{lemma}
In the special case of $p=2$, the optimum is given by the standard OLS estimator $\check{\theta}_2 = (\Vector{\beta} \DotProd \Vector{\beta})^{-1} \Vector{\beta}\DotProd \Vector{\alpha}$. Though closed form solutions are not available for arbitrary $p$---even in the well-studied case of $p=1$ \citep{Pollard_1991, portnoy1997, chen_analysis_2008}---the value is easily computed via numerical methods. Note that $\check{\theta}_p$ is unique for any strictly convex $L_p$ norm, but may form a compact interval for $p \in \{1, \infty\}$.

Next, we find the corresponding value(s) of $\rho$.
\begin{lemma}[\textit{Minimum leakage as a function of confounding}]\label{lemma:rho_star}
    Define $h_p := g_p \circ f$, such that $h_p: [-1, 1] \mapsto \mathbb{R}_{\geq 0}$ maps values of $\rho$ to $\lVert \Vector \gamma \rVert_p$. 
    For any $\check{\theta}_p$ (either a unique solution or any point on the compact interval of solutions), $h_p$ achieves its minimum at:
    \begin{align*}
        \check{\rho}_p &:= \argmin_{\rho \in [-1, 1]} ~h_p(\rho) = f^{-1}(\check{\theta}_p)\\
        &= \text{sin}\Bigg(\text{arctan}\Bigg(\frac{\KappaXY - \check{\theta}_p \KappaXX}{\sqrt{\KappaXX\KappaYY - \KappaXY^2}}\Bigg)\Bigg).
    \end{align*}
\end{lemma}
Lemmas~\ref{lemma:theta_star} and \ref{lemma:rho_star} provide an essential criterion for partial identification in the leaky IV model.
Define $\check{\tau}_p := g_p(\check{\theta}_p) = h_p(\check{\rho}_p)$. (Observe that this value is unique even when $\check{\theta}_p$ and $\check{\rho}_p$ are not.) 
Let $\theta^*$ denote the true ATE, with corresponding leakage weights $\Vector{\gamma}^* = \Vector{\alpha} - \theta^* \Vector{\beta}$ and oracle threshold $\tau^*_p := \lVert \Vector \gamma^* \rVert_p$, so named because it quantifies the precise (and unidentifiable) amount of information leakage from $\Vector{Z}$ to $Y$ in the true data generating process. These minimum and oracle thresholds fully characterize identifiability conditions in the leaky IV model.

\begin{theorem}[\textit{Identifiability}]\label{thm:id}
    Assume Eqs. \ref{eq:scmx}, \ref{eq:scmy}, and \ref{eq:scmSigma} and assumptions (A1), (A2), and (A$3'_s$) hold for some $p \geq 1$. 
    Then ATE bounds are:
    \begin{itemize}[noitemsep]
        \item undefined for all $\tau < \check{\tau}_p$;
        \item identifiable but invalid for all $\tau \in [\check{\tau}_p, \tau^*_p)$; and
        \item identifiable and valid for all $\tau \geq \tau^*_p$.
    \end{itemize}
    Moreover, the true ATE is identifiable iff $\tau^*_p = \check{\tau}_p$ and $g_p$ attains a unique minimum, in which case $\theta^* = \check{\theta}_p$.
\end{theorem}
The three-partition of threshold space implied by Thm. \ref{thm:id} is visualized in Fig. \ref{fig:id} for $p=2$, where we see how bounds go from nonexistent (grey striped region) to small but erroneous (red shaded region), only becoming valid above the oracle threshold $\tau^*_2$. 
Note that the perpendicular lines $\lVert \Vector \gamma \rVert_p = \tau^*_p$ and $\theta = \theta^*$ intersect at a point on the leakage curve. 
This illustrates that valid bounds in the leaky IV model are not generally symmetric about $\theta^*$. In fact, for $p \in (1, \infty)$ and $\tau^*_p > \check{\tau}_p$, the true ATE $\theta^*$ will coincide with one extremum of the partial identification interval at $\tau = \tau^*_p$.
%When $\tau < \check{\tau}_p$, the feasible region is empty and no configuration of parameters can satisfy our structural constraints. When $\tau = \check{\tau}_p$, upper and lower bounds collapse to a single point and the ATE is fully identified as $\check{\theta}_p$. Only when $\tau > \check{\tau}_p$ do we get a partial identification interval with nonidentical extrema, which must lie on either side of $\check{\theta}_p$.

The identifiability conditions of Thm. \ref{thm:id} have an immediate consequence for the classic linear IV model, which is a special case of our leaky IV model with $\tau=0$.
\begin{corollary}\label{corollary:testable}
    Under the assumptions of Thm. \ref{thm:id}, the exclusion criterion (A3) holds iff $\tau^*_p = \check{\tau}_p = 0$, in which case $\theta^* = \check{\theta}_2 = \theta^{\text{2SLS}}$.
\end{corollary}
As we will see in the sequel, this constraint has falsifiable consequences that can motivate the use of the leaky IV approach in practice. 

\begin{figure}[t]
  \centering
  \includegraphics[width=0.5428571\columnwidth]{figures/3part.pdf}
  \vspace{-2mm}
  \caption{Minimum and oracle leakage values impose a three-partition of the threshold space. Below $\check{\tau}_2$, we have \textit{the infeasible region} (grey striped area), where no configuration of latent parameters satisfies our structural constraints. Between $\check{\tau}_2$ and $\tau^*_2$, we have \textit{the error region} (red area), where bounds are identifiable but invalid. Above $\tau^*_2$, we have \textit{the valid region} (rest of the plot), where bounds are guaranteed to contain the true ATE $\theta^*$, represented by the vertical blue line.}
  \vspace{-2mm}
  \label{fig:id}
\end{figure}

We now reformulate our optimization task:
%Together, these lemmas provide a direct path toward solving our optimization problem, which can now be reformulated as follows:
\begin{align*}
    \underset{\rho \in [-1,1]} {\text{min/max}} \quad &\theta\nonumber \quad \text{s.t.} \quad h_p(\rho) = \tau. 
\end{align*}
This is a straightforward one-dimensional objective where all structural constraints have been absorbed into a single function $h_p$. 
We now have all the ingredients in place to state our main result.
\begin{theorem}[\textit{ATE bounds}]\label{thm:ate_bounds}
    Assume the conditions of Thm. \ref{thm:id} hold for some $\tau \geq \tau^*_p$. 
    %Assume Eqs. \ref{eq:scmx}, \ref{eq:scmy}, and \ref{eq:scmSigma} and assumptions (A1), (A2), and (A$3'_s$) hold for some $p \geq 1$ and $\tau \geq \tau^*_p$. 
    Then for any $\check{\rho}_p$ (either a unique solution or any point on the compact interval of solutions), there exist unique min/max values of the confounding coefficient consistent with the posited information leakage:
    \begin{alignat*}{2}
        \rho^-_{\tau, p} &:= \min_{\rho \in [-1, \check{\rho}_p]} \quad &&\rho \quad \text{s.t.} \quad h_p(\rho) = \tau\\
        \rho^+_{\tau, p} &:= \max_{\rho \in [\check{\rho}_p, 1]} \quad &&\rho \quad \text{s.t.} \quad h_p(\rho) = \tau.
    \end{alignat*}
    Plugging these values into $f$ produces valid and sharp ATE bounds:
    \begin{align*}
        \theta^-_{\tau, p} = f(\rho^+_{\tau, p}), \quad 
        \theta^+_{\tau, p} = f(\rho^-_{\tau, p}).
    \end{align*}
\end{theorem}
Analytic solutions are generally intractable for $p \neq 2$. 
However, Thm. \ref{thm:ate_bounds} guarantees the existence and uniqueness of valid, sharp ATE bounds in the leaky IV model for any $p \geq 1$. These values can be readily computed with numerical methods, e.g. linear programming techniques \citep{linear_opt}. 
For the $L_2$ case, we derive the following solution in closed form.
\begin{corollary}\label{corollary:l2}
    Under the assumptions of Thm. \ref{thm:ate_bounds} with $p=2$, min/max ATE values are given by: 
    \begin{align*}
        \check{\theta}_2 \pm ({\Vector{\beta}\DotProd \Vector{\beta}})^{-1} \sqrt{(\Vector{\beta}\DotProd \Vector{\beta}) ~(\tau^2- \Vector{\alpha}\DotProd \Vector{\alpha}) + (\Vector{\alpha}\DotProd \Vector{\beta})^2 }.
\end{align*}
\end{corollary}

\subsection{Vector $\tau$-exclusion}\label{sec:vector_tau}
Our scalar $\tau$-exclusion criterion is somewhat crude, as it applies a single threshold on a summary statistic of all $\Vector{\gamma}$ weights. 
% Using a single scalar $\tau$ reflects somewhat a crude background knowledge about the relationship between the variables $\Vector{\gamma}$ and the outcome $Y$, as it applies a single threshold on a summary statistic of all $\Vector{\gamma}$ weights. 
%The scalar $\tau$ used by our method applies a single threshold on a summary statistic of all $\Vector{\gamma}$ weights, reflecting a somewhat crude background knowledge about the relationship between the variables $\Vector{\gamma}$ and the outcome $Y$. 
In many cases, however, background knowledge may license a more fine-grained approach that applies separate thresholds either to individual candidate instruments or groups thereof. 
For instance, in a Mendelian randomization study, we may partition SNPs by chromosome, exploiting biological knowledge to permit more or less leakage as we move across the genome. 
Alternatively, we may impose the restriction that our $\Vector{Z}$ variables should be more ``relevant'' than ``leaky'', with each $\beta_j$ coefficient exceeding the corresponding $\gamma_j$ in absolute value.

These considerations inspire a more heterogeneous relaxation of the exclusion criterion characterized by (A$3'_v$), which we refer to as \textit{vector $\tau$-exclusion}. 
%Placing thresholds on the absolute value of each $\gamma$ coefficient creates a hyperrectangular feasible region. With fixed covariance parameters, the vector $\bm \gamma$ is constrained to the hyperplane $\bm{a} - \bm{b} \theta$. Partial identification is possible iff this hyperplane intersects the feasible region. Our optimization objective is to find the min/max values of $\theta$ that locate $\bm \gamma$ within this intersection.
Fortunately, the solution in this case follows naturally from our previous analysis. 
Suppose that all thresholds are strictly positive, i.e. that $\lVert \Vector \tau \rVert_0 = d_{\Vector Z}$. 
Then we simply perform a linear transformation of all candidate instruments, scaling them by their respective leakage thresholds to create modified variables $\tilde{Z}_j := Z_j / \tau_j$. 
Let $\Vector{\tau}^+ := [1, 1, \Vector{\tau}]$ denote an augmented threshold vector of length $2+d_{\Vector{Z}}$, with dummy entries for $X$ and $Y$. 
Then, we define a square transformation matrix $\Matrix{T}$ with entries $\Matrix{T}_{ij} := 1 / (\tau^+_i \tau^+_j)$ and update our covariance parameters:
\begin{align*}
    \Tilde{\SSigma} := \Matrix{T} \odot \SSigma,
\end{align*}
where $\odot$ denotes the Hadamard product (i.e., entrywise multiplication). 
Plugging this matrix into the equations of Sect.~\ref{sec:scalar_tau} produces transformed linear weights $\tilde{\Vector{\alpha}},\tilde{\Vector{\beta}}, \tilde{\Vector{\gamma}}$. 
Vector $\tau$-exclusion can now be rewritten as:
\begin{itemize}[itemindent=2em]
    \item[(A$3'_{v2}$)] \textit{Vector $\tau$-exclusion:} $\lVert \tilde{\Vector{\gamma}}\rVert_\infty \leq 1$.
\end{itemize}
%Whereas the feasible region originally corresponded to a potentially unruly hyperrectangle in $\Vector{\gamma}$ space, our linear transformation has reduced it to the unit hypercube in a new space $\tilde{\Vector{\gamma}}$.
All previous results go through just the same, including identifiability conditions (Thm. \ref{thm:id}) and optimal ATE bounds (Thm.~\ref{thm:ate_bounds}). 
%Closed form solutions are not available for $L_\infty$ optimization, but the problem can be formulated as a linear program and efficiently solved using standard software. 

This strategy will need to be modified if we wish to impose $\tau_j = 0$ for some $j \in [d_{\Vector Z}]$---i.e., to treat some variable(s) as valid IVs that satisfy the classical exclusion criterion. 
Let $\Vector{S}_0 \subset [d_{\Vector Z}]$ pick out all and only those features such that $\tau_j=0$, with complementary subset $\Vector{S}_1 := [d_{\Vector Z}] \backslash \Vector{S}_0$.
Then for each $j \in \Vector{S}_0$, we set the corresponding entry in $\Vector{\tau}^+$ to 1 in our construction of the transition matrix $\bm T$ to avoid division by zero, and update the formula for $\tilde{\Vector \gamma}$ to reflect the reduced degrees of freedom:
\begin{align*}
    \tilde{\gamma}_j =
    \begin{cases}
        0, & \text{if}~j \in \bm{S}_0;\\
        \tilde{\alpha}_j - \theta \tilde{\beta}_j, & \text{otherwise},
    \end{cases}
\end{align*}
for all $j \in [d_{\Vector Z}]$.
We modify the definitions of $g_p, h_p$, and $\tau^*_p$ to restrict their range to just those $j \in \Vector{S}_1$.
Now (A$3'_{v2}$) applies and produces sharp bounds, as desired.

\subsection{Inference}\label{sec:cvg}
Thm.~\ref{thm:ate_bounds} provides an exact solution with the population covariance matrix $\SSigma$. 
In practice, of course, all parameters must be estimated from finite data.
We generally take the sample covariance matrix $\hat{\SSigma}$ as our plug-in estimator, but many alternatives are possible. 
Numerous Bayesian \citep[Ch.~3.6]{leonard1992, daniels1999, gelman_bayesian} and penalized likelihood \citep{Schafer2005, warton2008, won_condition-number-regularized_2012} methods have been proposed for this task, or the closely related task of estimating a regularized precision matrix \citep{Friedman2007, cai2011, Mazumder2012}. 
Several of these options are implemented in our accompanying software package. 
These alternatives may be especially attractive in high-dimensional settings with a large number of leaky IVs to ensure a positive definite $\hat{\SSigma}$.

We use a variety of estimators in our experiments below and augment the procedure with inference techniques. First, we describe a parametric test of the exclusion criterion, as foreshadowed by Corollary \ref{corollary:testable}. 
In the linear IV model with $d_{\Vector Z} \geq 2$, it is well known \citep{kuroki2005, Chen_Tian_Pearl_2014, silva_shimizu_2017} that exclusion imposes a set of so-called \textit{tetrad constraints} on covariance parameters of the form: 
\begin{align*}
    \SSigma_{z_j y} \SSigma_{z_k x} - \SSigma_{z_j x} \SSigma_{z_k y} = 0.
\end{align*}
If this holds for all nonidentical pairs of candidate instruments $j,k \in [d_{\Vector Z}]$, then $\Vector \alpha, \Vector \beta$ vectors are parallel and information leakage goes to zero. 
Define the $d_{\Vector Z} \times 2$ matrix $\Vector \Lambda := [\SSigma_{\Vector{z}x}, \SSigma_{\Vector{z}y}]$. Then our test statistic is $\psi := \det(\Vector \Lambda \DotProd \Vector \Lambda)$ and our null hypothesis is $H_0: \psi = 0$, which is necessary and sufficient for $\check{\tau}_p=0$.

We propose to test $H_0$ via Monte Carlo, estimating $\hat{\theta}^{\text{2SLS}}$ on the original data and creating a null covariance matrix $\SSigma^0$ by replacing $\hat{\SSigma}_{\Vector{z}y}$ with $\SSigma_{\Vector{z}y}^0 := \hat{\SSigma}_{\Vector{z}x} \hat{\theta}^{\text{2SLS}}$. 
We assume that samples are distributed according to some $P_{\SSigma} \in \mathcal{P}$, where $\mathcal{P}$ denotes a family of distributions parameterized by a covariance matrix $\SSigma$---obvious examples include multivariate Gaussian and $t$-distributions, although alternatives such as multivariate binomial or Poisson distributions are also viable \citep{krummenauer1998, jiang2021}. 
So long as we can sample data under fixed values of $\SSigma$, we can perform the following test.
\begin{theorem}[\textit{Exclusion test}]\label{thm:test}
    Let $\mathcal{D}_n = \{x_i, y_i, \Vector{z}_i\}_{i=1}^n$ be a dataset generated according to the conditions of Thm. \ref{thm:id}, with $\mathcal{D}_n \sim P_{\SSigma}$, $d_{\Vector{Z}} \geq 2$, and sample estimate $\hat{\psi}_n$.
    Construct a null covariance matrix $\SSigma^0$ as detailed above. Draw $B$ synthetic datasets of size $n$, $\mathcal{D}^0_{n, (b)} \sim P_{\SSigma^0}$, and record the test statistic $\psi^0_{n, (b)}$ for all $b \in [B]$.
    Then as $n, B \rightarrow \infty$, the following is an asymptotically valid $p$-value against $H_0$:
    \begin{align*}
        p_{\textup{MC}} = \frac{\# \big\{b: \psi^0_{(b)} \geq \hat{\psi}_n \big\} + 1}{B+1}.
    \end{align*}
\end{theorem}
Thm. \ref{thm:test} describes a frequentist method for testing the exclusion criterion in linear IV models. Sufficiently small values of $p_{\text{MC}}$ can motivate a leaky approach, as 2SLS results in biased ATE estimates when (A3) fails. Note that $\check{\tau}_p=0$ is a necessary but insufficient condition for exclusion, which additionally requires that $\tau^*_p=0$. A minimum possible leakage of zero provides no evidence that the true leakage is in fact zero. 

Next, we introduce a nonparametric bootstrapping procedure \citep{efron_bootstrap} to quantify the uncertainty of ATE bounds. 
Specifically, we draw $B$ many datasets of size $n$ by sampling with replacement from the input data and estimate the covariance matrix $\hat{\SSigma}_{(b)}$ for each $b \in [B]$.
Our target parameters are $(\theta^-, \theta^+)_{\tau, p}^*$, i.e. the bounds we would expect from a \textit{partial identification oracle} (henceforth a $\SSigma$-oracle) with knowledge of the population covariance matrix for observed variables $\{X, Y, \Vector{Z}\}$. 
Note that even with access to the true $\SSigma$, $\tau^*_p$ and $\theta^*$ remain unidentifiable, and so we distinguish between $\SSigma$-oracles and oracles \textit{tout court}, who are additionally omniscient with respect to latent parameters and therefore able to point identify the ATE.
By contrast, with $\tau \in [\check{\tau}_p, \tau^*_p)$, a $\SSigma$-oracle will produce invalid bounds that lie in the error region of Fig. \ref{fig:id} (see Thm. \ref{thm:id}).

Since $\check{\tau}_p$ depends on the data, it is possible that some bootstraps may violate the partial identifiability criterion $\tau \geq \check{\tau}_p$, especially if the sample size is small and/or the selected threshold is close to the true leakage minimum as determined by a $\SSigma$-oracle.
%$\check{\tau}^*_p$ (computed with the population covariance matrix). 
Our estimator is undefined when the feasible region is empty, so we discard any offending bootstraps. As $\check{\tau}_p$ can be estimated on the full dataset upfront, this issue may be mitigated by selecting a threshold sufficiently high above this value. The procedure comes with the following coverage guarantee.
\begin{theorem}[\textit{Coverage}]\label{thm:cvg}
    Let $\mathcal{D}_n = \{x_i, y_i, \Vector{z}_i\}_{i=1}^n$ be a dataset generated according to the conditions of Thm. \ref{thm:id}. Draw $B$ samples with replacement from $\mathcal{D}_n$, subject to $\tau \geq \check{\tau}_{p, (b)}$ for all $b \in [B]$. For a given $\diamond \in \{-, +\}$ and level $\alpha \in (0,1)$, we construct the confidence interval $\hat{C}_n = [\hat{q}_l, \hat{q}_u]$ as follows. 
    Let $\hat{q}_l$ be the $l$th smallest value of the bootstrap distribution for $\hat{\theta}^{\diamond}_{\tau, p}$, with $l = \lceil (B+1)(\alpha / 2)  \rceil$. 
    Let $\hat{q}_u$ be the $u$th smallest value of the same set, with $u = \lceil (B+1)(1 - \alpha / 2) \rceil$.
    Then as $n, B \rightarrow \infty$, we have:
    \begin{align*}
        \mathbb{P}\big(\theta^{\diamond *}_{\tau, p} \in \hat{C}_n\big) \geq 1 - \alpha.
    \end{align*}
\end{theorem}
We can smooth the bootstraps with kernel density estimation (see Sect. \ref{sec:experiments}) or use a Bayesian bootstrap to get an approximate posterior distribution \citep{rubin1981}. 
Either way, Thm. \ref{thm:cvg} provides a template for testing claims about whether, for instance, zero lies above or below the partial identification interval with high probability.
%Bootstrapping provides a sampling distribution for ATE bounds, allowing us to test whether, for instance, zero lies above or below the partial identification interval with high probability.
%We can smooth the bootstrap distribution via kernel density estimation or use a Bayesian bootstrap to get an approximate posterior distribution \citep{rubin1981}. 
%We can also invert confidence intervals to compute $p$-values for frequentist inference. 
%For more details on bootstrap hypothesis testing, see \citep{davison_hinkley}.


\section{Experiments}\label{sec:experiments}

% Full details of all simulation experiments are provided in Appx.~\ref{appx:exp}. Unless stated otherwise, we use $p=2$ throughout.
For full details of all simulation experiments, see Appx.~\ref{appx:exp}. Code for reproducing all results and figures can be found on our dedicated GitHub repository.\footnote{\url{https://github.com/dswatson/leakyIV}.} In this section, we use $p=2$ throughout and estimate all covariance parameters via maximum likelihood.
%For results with $p=1$, see Appx. \ref{appx:more_exp}. 

\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/benchmark_Toeplitz_SNRx=2.pdf}
  \vspace{-2mm}
  \caption{Comparison against various methods at a range of values for the confounding coefficient $\rho$, SNR for $Y$, and number of candidate instruments $d_{\Vector{Z}}$. 
  The horizontal black line at $\theta = 1$ represents the true ATE $\theta^*$.}
  \vspace{-2mm}
  \label{fig:benchmarks}
\end{figure}

For our benchmark experiments, we generate data according to Eqs.~\ref{eq:scmx}-\ref{eq:scmSigma} with the following process:
\begin{equation*}
\begin{gathered}
    \Vector{Z} \sim \mathcal{N}(0, \SSigma_{\Vector{z}\Vector{z}}) \quad
    \Vector{\beta} \sim \mathcal{N}(0, 1) \\
    \Vector{\gamma} \sim \mathcal{N}(0, 1) \times \zeta \quad
    \epsilon_x, \epsilon_y \sim \mathcal{N}(0, \SSigma_{\Vector{\epsilon}\Vector{\epsilon}}),
\end{gathered}
\end{equation*}
and fixed $\SSigma_{yy} = 10$.
The scaling factor $\zeta$ and residual variance parameters $\eta^2_x, \eta^2_y$ are chosen to ensure that the signal-to-noise ratio (SNR) of Eqs.~\ref{eq:scmx} and \ref{eq:scmy} are fixed at the desired level (for details, see Appx.~\ref{appx:snr}). 
We simulate data from the leaky IV model under a range of hyperparameters:
\begin{itemize}[noitemsep]
    \item Dimensionality $d_{\Vector{Z}}$ is selected from $\{5, 10\}$.
    \item The covariance matrix $\SSigma_{\Vector{zz}}$ is either diagonal or Toeplitz with autocorrelation $0.5$. In either case, we set marginal variance to $1 / d_{\Vector Z}$ for each $Z$.
    \item The confounding coefficient $\rho$ is selected from $\{-0.9, -0.8, \dots, 0.9\}$.
    \item The SNR for $X$ is selected from $\{0.5, 1, 2\}$.
    \item The SNR for $Y$ is selected from $\{0.5, 1, 2\}$.
\end{itemize}
Taking the Cartesian product of all these hyperparameters generates a grid of 684 unique simulation configurations. We hold the sparsity of $\Vector \gamma$ fixed at $0.2$ and set the true ATE $\theta^*$ to 1 across all experiments.

%Specifically, we set the SNR for $X$ to $2$ to ensure a strong signal from $\Vector{Z}$, and range the SNR for $Y$ from $\sfrac{1}{2}$ to $2$. We additionally vary the confounding coefficient $\rho \in [-0.9, 0.9]$ and number of candidate instruments $d_{\Vector{Z}} \in \{5, 10\}$. 
%We hold the sparsity of $\Vector{\gamma}$ fixed at $0.2$ and use a diagonal covariance matrix $\SSigma_{\Vector{z}\Vector{z}}$. We set the true ATE $\theta^*$ to $1$ for all simulation settings. 

%(a) the signal-to-noise ratio (SNR) of Eqs.~\ref{eq:scmx} and \ref{eq:scmy} are fixed at the desired level (see below); and (b) $\SSigma_{xx} = \SSigma_{yy} = 1$. 
%This latter constraint helps with interpretation, as it means that $\theta^2$ denotes the proportion of $Y$'s variance explained by $X$ given $\Vector{Z}$.

\paragraph{Point Estimators.}
We present the mean and standard deviation of ATE estimates for a range of methods, computed across 50 runs of $n=1000$. For our own own approach, LeakyIV, we set $\tau = 1.1 \tau^*_2$ and shade the interval between our mean estimates for $(\hat{\theta}^-, \hat{\theta}^+)_{\tau, 2}$.
We benchmark against two classic methods---the backdoor adjustment and 2SLS---as an illustrative baseline. We also compare our results to two methods designed for causal inference with some invalid instruments: 
sisVIVE, which performs implicit feature selection via an $L_1$ penalty on the candidate IVs \citep{kang2016}; and 
mode-based estimation (MBE), which treats invalid instruments as outliers that can be ignored using robust inference techniques \citep{Hartwig2017}.
%and MASSIVE, a Bayesian model averaging approach designed for high-dimensional instruments \citep{bucur2020}. 
We refer readers to the original papers for details on each. 
%Together, these methods represent a diverse array of solutions based on different but overlapping assumptions.


%A more general overview of our performance against comparative benchmarks is provided in Fig. BLAH. We see here that our method captures the true ATE in BLAH percent of simulations. Meanwhile, performance across our benchmarks varies widely. 2SLS fares surprisingly well, given that no simulation considered here satisfies its assumption that all instruments are valid. sisVIVE performs very similarly throughout, even in settings where the true $\bm{\gamma}$ vector is sparse and it should have a comparative advantage. MBE is more erratic. It appears to be a decreasing function of $\rho$, vastly overestimating causal effects at $\rho = -0.75$ and underestimating them at $\rho = 0.75$. 

%We benchmark against three algorithms designed for causal inference with some invalid instruments: sisVIVE \citep{kang2016}, MBE \citep{Hartwig2017}, and Ivy \citep{kuang2020}. These represent a diverse array of solutions based on vastly different assumptions and methods.

%A key assumption of these algorithms is that some or most of the candidate IVs are \emph{valid}, i.e. satisfy (A1)-(A3). We make no such assumption, permitting all our soft instruments to have a direct effect on $Y$ so long as the cumulative signal is bounded by $\tau$. 

Results for $\SSigma_{\Vector{zz}}$ = Toeplitz, $\text{SNR}_X = 2$ are presented in Fig. \ref{fig:benchmarks}. (Results are broadly similar for alternative simulations; see  Appx. \ref{appx:more_exp}.) 
%The jagged lines are due to the fact that we simulate new values for $\Vector{\beta}$ and $\Vector{\gamma}$ at each step.
We find that the backdoor adjustment is systematically biased downward for $\rho < 0$ and upward for $\rho > 0$, exactly as theory predicts. In most cases, confounding effects on either end of the $x$-axis are sufficiently strong to send the curve beyond the limits of our estimated partial identification interval. Alternative methods designed for the IV setting fare better, but still behave somewhat erratically. MBE in particular appears prone to occasional bursts of uncertainty, especially under extreme confounding. 

By contrast, our bounds contain the true ATE in 683 out of 684 settings, or 99.85\% of the time. Moreover, they are generally informative, capturing the true direction of causal effects in over half of all trials despite a relatively weak signal from the exposure $X$. Our bounds are clearly \textit{correlated} with results from IV point estimators, but whereas competitors tend to overstate their confidence---bouncing between positive and negative causal effects multiple times in each panel---our bounds almost never stray so far as to miss the true ATE.

%Interestingly, sisVIVE and MBE are almost indistinguishable from 2SLS in these experiments. All three methods hover around the true ATE and almost always fall within our estimated interval. However, they are prone to random and occasionally sizeable errors, for instance estimating $\hat{\theta} < -2.5$ in the high SNR regime with $d_{\Vector Z} = 5$ and $\rho = 0.5$. This setting also drives LeakyIV's estimated minimum down to its lowest point, but our maximum remains above the true value of $\theta^*=1$. 

\begin{figure}[t]
  \centering
  \includegraphics[width=0.925\columnwidth]{figures/bayesian.pdf}
  \vspace{-2mm}
  \caption{Comparison against a Bayesian model at a range of values for the confounding coefficient $\rho$. Histograms represent 2000 samples from a posterior distribution estimated via MCMC. The solid blue line denotes the true ATE, while the dashed red lines indicate LeakyIV bounds.}
  \vspace{-2mm}
  \label{fig:bayesian}
\end{figure}

\paragraph{Bayesian Methods.}
An alternative family of methods for modeling latent parameters in the IV setting is based on Bayesian inference \citep{Shapland2019, bucur2020, Gkatzionis2021}. The goal in this approach is to estimate a posterior distribution for $\theta$, with partial identification bounds given by the upper and lower $\alpha$-quantiles of the credible interval. Rather than compare against some off the shelf method that does not explicitly encode our $\tau$-exclusion criterion, we design a Markov chain Monte Carlo (MCMC) sampler to  model causal effects in the leaky IV setting (for details, see Appx.~\ref{appx:bayes}). Due to the computational demands of MCMC sampling, we focus on the case where $d_{\Vector{Z}}=5, \SSigma_{\Vector{zz}}$ is diagonal, and the SNR for both $X$ and $Y$ is 2, varying only $\rho$ across a range of six possible values. 

Results are presented in Fig. \ref{fig:bayesian}, featuring 2000 draws from the estimated posterior distribution for $\theta$. The blue line denotes the true ATE $\theta^*=1$, while red dashed lines indicate bounds estimated by LeakyIV. We observe that even with uninformative priors on all linear parameters and just $n=1000$ observations, the posterior tends to concentrate around a biased estimate, occasionally even placing some of the density outside our partial identification interval. 
Bayesian methods struggle in this setting because every solution in the feasible region has the same likelihood, which makes posteriors especially sensitive to the choice of prior distribution. 
Moreover, these methods do not decouple bounds on the causal parameter from the causal parameter itself. Any claims that posterior quantiles can be interpreted as ``bounds'' are either (i) a direct consequence of the prior; or (ii) heuristics that may be impossible to interpret outside the infinite data limit with an uninformative prior. Of course, this defeats the purpose of having priors on parameters in the first place.
%MCMC sampling can be inefficient and misleading in the leaky IV model.

\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/power.pdf}
  \vspace{-2mm}
  \caption{Power curves for the Monte Carlo exclusion test at varying values of the confounding coefficient $\rho$ and sample size $n$. Shading denotes standard errors. The horizontal dashed red line denotes the target level $\alpha = 0.1$.}
  \vspace{-2mm}
  \label{fig:pwr}
\end{figure}

\paragraph{Power.}
We run a series of power simulations to evaluate the sensitivity of our Monte Carlo exclusion test. With $d_{\Vector Z}=5, \theta^*=1$, diagonal $\SSigma_{\Vector{zz}}$, and fixed $\text{SNR} = 2$ for both $X$ and $Y$, we vary the sample size $n \in \{500, 1000, 2000\}$ and effect size $\check{\tau}_2 \in \{0, 0.1, \dots, 1\}$ under six values of confounding $\rho$. We compute $p$-values using $B=2000$ replicates and reject $H_0$ at level $\alpha = 0.1$. Empirical rejection rates are recorded over 500 runs (see Fig. \ref{fig:pwr}). We find that type I error is controlled at the target level across all simulations, while power steadily increases with greater effect size, as expected. At $n=2000$, we attain 95\% power in all settings.

\paragraph{Coverage.}
Under the simulation settings of the Bayesian benchmark experiment, we evaluate nominal coverage using three bootstrap variants: the standard empirical distribution, a smoothed kernel estimate, and a Gaussian approximation. We generate 500 unique datasets for each setting and run 2000 bootstraps with fixed level $\alpha = 0.1$. Results are presented in Fig. \ref{fig:cvg}. 
Empirical coverage is very close to the nominal 90\% target in all settings, with a minimum of 0.886. The target level is always within a standard error of the mean across all trials.
Though performance is similar for all three estimators,  the Gaussian approximation appears slightly more conservative on average. 

\section{Related Work}\label{sec:related}
%Much of the classic work on IV models assumes a binary setting with just a single instrument, i.e. $X, Y, Z \in \{0,1\}^3$ \citep{angrist1996, hernan_instruments_2006, Baiocchi2014}. Early work on ATE bounding \citep{Balke1997, manski2000, swanson2018}.

Violations of the exclusion restriction are well-documented in genetics \citep{Hemani2018} and econometrics \citep{Berkowitz2008}. 
One strategy for estimating causal effects in such settings is to permit a large number of potentially invalid instruments under the assumption that their average bias will tend to zero in the limit \citep{Bowden2015,Kolesar2015}. 
With weak monotonicity constraints, these approaches can also provide nonparametric bounds on local average treatment effects \citep{Flores2013}.

\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/coverage.pdf}
  \vspace{-2mm}
  \caption{Empirical coverage of LeakyIV using three bootstrap estimators. Whiskers denote standard errors. The horizontal dashed red line denotes the nominal target of 90\%.}
  \vspace{-2mm}
  \label{fig:cvg}
\end{figure}

Another family of methods starts from the assumption that some proportion of candidate instruments are valid and uses statistical procedures to focus on just those variables that satisfy (A1)-(A3). 
This can be achieved, for instance, via goodness of fit tests \citep{chu_semi_instrumental}; $L_1$-penalized regression for feature selection \citep{kang2016,Guo2018,Windmeijer2019}; independence tests for collider bias \citep{kang2020}; or modal validity assumptions in linear \citep{Hartwig2017} and nonlinear \citep{hartford2021} IV models. 
Alternatively, data from multiple instruments can be pooled into a single variable using dimensionality reduction techniques \citep{kuang2020}.
% \begin{figure}
%   \centering
%   \includegraphics[width=\linewidth]{figures/3d_causal.png}
%   \caption{ Feasible area for $\theta$ is defined by $\tau$}
%   \label{fig:alpha_rho}
% \end{figure}\label{fig:alpha_3d}

There is a substantial literature on Bayesian approaches to causal inference in IV settings. \citet{Lenkoski2014} use Bayesian model averaging to select IVs based on the strength of their association with the treatment, as codified by the relevance criterion (A1). \citet{Shapland2019} extend this method to account for linkage disequilibrium in Mendelian randomization experiments, which violate the no confounding condition (A2). 
More recently, several authors have proposed spike-and-slab priors to select genetic variants in the face of horizontal pleiotropy \citep{bucur2020,Gkatzionis2021}, thereby addressing (A3). 
%These Bayesian approaches typically rely on strong parametric assumptions and/or intensive computation, but they arguably do a better job than their frequentist counterparts of reflecting the inherent uncertainty of model selection and parameter estimation. 

\citet{Conley2012} propose ATE bounding methods given various kinds of prior information on leaky coefficients $\Vector \gamma$, including a range restriction and a (non-uniform) prior distribution. For the former, they return a union of confidence intervals resulting from a grid of possible values for $\Vector \gamma$. 
This method scales exponentially with $d_{\Vector Z}$ and is potentially conservative as grid resolution grows finer. By contrast, our convex optimization approach is an efficient one-shot procedure that provides provably sharp bounds---in closed form, for the $L_2$ case. 
Their Bayesian proposal is similar to the method we compare against in Fig. \ref{fig:bayesian}. 
%We reiterate that such methods tend to struggle in partial identification tasks because every solution in the feasible region has the same likelihood.

The literature on tetrad constraints in linear SEMs goes back over a century \citep{spearman1904}, although the method was revived and refined following the publication of \citet{Spirtes2000}'s tetrad representation theorem. Our exclusion test builds on generalized results developed by numerous authors \citep{shafer_tetrads, sullivant2010, spirtes2013}, although we are to our knowledge the first to propose a Monte Carlo inference procedure for testing such claims.

Partial identification intervals for counterfactual quantities can be computed exactly for discrete variables by formulating the problem as a polynomial program \citep{zhang2022, duarte2023}. 
Though continuous data can always in principle be discretized with arbitrary precision, this quickly becomes intractable in the response function framework, as it leads to an exponential explosion of parameters. 
Some continuous alternatives have been proposed with applications to IV models \citep{kilbertus2020class, hu2021, padh_stochastic2023}. 
However, the neural architectures underlying these models can be notoriously unstable, and are ill-suited to the linear SEM setting, which is standard in much biological and econometric research.

%Our solution is similar in some respects to what \citet{Rothenhausler2021} call \emph{anchor regression}, which interpolates between OLS and 2SLS. The eponymous anchors are akin to $\bm{Z}$ in our setup, and allowed to violate not just (A3) but also (A2). 
%However, causal inference is not the focus of anchor regression, which is designed primarily for predicting outcomes with robustness guarantees. 

\section{Discussion}\label{sec:discussion}

We have limited our analysis in this paper to linear SEMs. 
% In this paper we focus our analysis on linear SEMs. 
% Though linear methods remain popular in many applications, it is well known that real-world systems often involve nonlinear dependencies between variables. 
Though such linear models remain popular in many applications, it is well known that real-world systems often involve nonlinear dependencies between variables. 
To generalize the concept of leaky instruments, we could reformulate $\tau$-exclusion to place an upper bound on the conditional mutual information: 
%Another way to state (A3) is to say that $\Vector{Z}$ and $Y$ share no mutual information after conditioning on $X$ and $\Vector{U}$, i.e. $I(\Vector{Z};Y \mid X,\Vector{U}) = 0$. 
%We could therefore place an upper bound on this quantity:
\begin{itemize}[itemindent=2em]
    \item[(A$3''$)] \emph{Generalized $\tau$-Exclusion:} $I(\Vector{Z};Y \mid X,\Vector{U}) \leq \tau$.
\end{itemize}
%to limit the direct influence of $\Vector{Z}$ on $Y$ in more general terms. 
Alternatively, (A$3''$) could place a bound on the gap between $p(y \mid x,\Vector{u})$ and $p(y \mid x,\Vector{u},\Vector{z})$ using some appropriate measure such as the Wasserstein distance or the KL-divergence. 
% (A$3''$) could be formulated as a bound on the distance between $p(y \mid x,\Vector{u})$ and $p(y \mid x,\Vector{u},\Vector{z})$ using some appropriate measure like the Kullback-Leibler divergence or an integral probability metric. 
For binary $X, Y, Z$, \cite{ramsahai2012} and \cite{silva2016} represented this as a difference in expectations $|\mathbb E[Y \mid Z = 1, do(x)] - \mathbb E[Y \mid Z = 0, do(x)]| \leq \tau_z$. 
% \cite{ramsahai2012} and \cite{silva2016} represent this as a difference in expectations $|\mathbb E[Y~|~Z = 1, do(x)] - \mathbb E[Y~|~Z = 0, do(x)]| \leq \tau_z$ for binary $X, Y$ and $Z$.
Extensions to vector-valued variants are conceptually straightforward.
Estimating these quantities is more difficult than computing linear coefficients, but could help extend our approach to a wider class of data generating processes.

%Though our work is motivated by violations of the exclusion criterion (A3), our solution in fact extends to some violations of the no confounding assumption (A2). Specifically, if $\Vector Z$ and $Y$ (but not $X$) share a latent confounder, then our method will still recover sharp bounds on the ATE. This is due to the Markov equivalence of our assumed graph (Fig. \ref{fig:sem}) and an alternative in which the direct effect $\Vector \gamma$ is replaced with (or augmented by) a bidirected arc between $\Vector Z$ and $Y$. A more general way to state our model assumptions would therefore be to replace (A2) and (A3) with a bound on the conditional covariance of $\Vector Z$ and $Y$ given $X$. However, our method cannot be used to bound the ATE under arbitrary violations of (A2), since a fully confounded setup in which the latent $\Vector U$ has edges into all observable nodes would require further assumptions. This is an important topic for future work.

We have assumed in this paper that treatment effects are homogeneous throughout the population. However, a great deal of recent literature in causal machine learning has focused on \emph{heterogeneous} treatment effects, where potential outcomes are presumed to vary as a function of pre-treatment covariates \citep{Chernozhukov2018, Kunzel2019, Nie2021}. 
Some authors have brought this framework into IV models, showing that tighter ATE bounds are possible with the help of observed confounders \citep{cai2007, hartford2021, kennedy2023}. 
Future work will consider \emph{conditional} bounding methods, where extrema for $\theta$ may depend on instruments and/or other features causally antecedent to $X$.

%the KL-divergence (which would be equivalent to the information theoretic version above) or an integral probability metric. 
%Though this may make analytic solutions impossible without further assumptions, future work will consider numerical optimization strategies better suited to such generalized versions of $\tau$-exclusion.
%Another way to relax (A$3'$) would be to allow the bound to hold with some minimum probability. The resulting assumption would essentially be a PAC bound on the exclusion criterion, permitting $\tau$-violations with frequency no greater than $\delta$. The resampling procedures described above could easily be used to evaluate such claims.

% Our method allows users to specify global or local leakage thresholds. 
% However, more ambiguous alternatives are possible, such as cases where practitioners may know that some proportion of their candidate IV's are invalid without knowing which ones. 
% We will consider this possibility in future work.

A final note is that our method relies on user specification of the hyperparameter $\tau$. 
Ill-chosen thresholds may lead to issues, either in the form of overly conservative bounds (if $\tau$ is too high) or no bounds at all (if $\tau$ is below the minimum consistent with the data).
In the worst case, invalid bounds will result from selecting a threshold that is high enough to satisfy the partial identification criterion but underestimates the true value of $\lVert \Vector{\gamma} \rVert_p$. 
However, we emphasize that $\tau$-exclusion is a strictly weaker assumption than the classical exclusion criterion (A3), which is widely applied---rightly or wrongly---in IV analyses. 
% Setting $\tau=0$ may in many cases be a basically arbitrary choice, and a tempting one given the veneer of certainty provided by obtaining a point estimate for the ATE (not to mention the convenient fact that (A3) is untestable). 
In many cases, setting $\tau=0$---i.e., assuming perfectly exclusive instrumental variables---is an essentially arbitrary choice, and a tempting one given the veneer of certainty provided by obtaining a point estimate for the ATE. 
If nothing else, we hope to make practitioners think twice before falling back on this familiar default.
Background knowledge is regularly used to guide hyperparameter selection, and it is reasonable to assume that in many applications practitioners will have an \textit{a priori} sense of how much information leakage is likely between $\Vector{Z}$ and $Y$. 
In particular, we can interpret this as a type of sensitivity analysis that asks how large $\tau$ can be such that the bounds exclude zero or effects of particular magnitude, and inquire with a practitioner whether larger values of $\tau$ are scientifically plausible.


%Still, we argue that it is worth critically reflecting on how much direct influence one's candidate instruments may have on an outcome of interest.

%We argue that this is a fairly light requirement. 
%Background knowledge is regularly used to guide hyperparameter selection in many domains, and it is reasonable to assume that practitioners will often have an \emph{a priori} sense of how much information leakage is likely between $\bm{Z}$ and $Y$. 

%Plausible values can be estimated by a sort of hybrid 2SLS approach, in which we train a linear model for $\hat{X} = \mathbb{E}[X|\bm{Z}]$, then plug these fitted values into the regression for $\mathbb{E}[Y|\bm{Z}, \hat{X}]$ (see Eqs.~\ref{eq:scmx}, \ref{eq:scmy}). Resulting $\bm{\gamma}$ vectors provide some sense of plausible ranges for any given norm. Of course, this heuristic is unreliable since $\hat{X} \dep Y$ if $\lVert \bm{\gamma} \rVert_0 > 0$. However, it will provide better estimates than $X$ so long as (A1) and (A2) are satisfied. 

%We have not rigorously explored the sensitivity of our results to sampling variability. This and other sources of uncertainty could be jointly analyzed using nonparametric techniques like the Bayesian bootstrap \citep{rubin1981,Newton1994,Lyddon2019}, which can approximate a posterior distribution not only for the ATE itself but for the associated minima and maxima \citep{silva2016}. 

% Comment on (A2)?

\section{Conclusion}
\label{sec:conclusion}
We have presented a novel procedure for bounding causal effects in linear SEMs with unobserved confounding. 
By relaxing the exclusion criterion associated with the classical IV design, which often fails in many practical settings, our approach extends to a wide range of problems in genetic epidemiology, econometrics, and beyond. 
We introduce the notion of \emph{leaky instruments}, which exert a limited direct effect on outcomes, and derive partial identifiability conditions for the ATE under minimal assumptions. 
Resulting bounds are sharp and practical, providing causal information in many cases where classical methods fail. 
We propose a Monte Carlo test that can falsify the exclusion criterion and a bootstrapping subroutine that guarantees asymptotic coverage at the target level. 
Future work will extend our results to multidimensional treatments, conditional bounding problems, and nonlinear systems, where alternative optimization strategies based on stochastic gradient descent may be required. 