Though we focus on the case where $\Vector Z$ has a direct effect on $Y$, another way to violate the exclusion criterion would be to allow for unmeasured confounding between $\Vector Z$ and $Y$. 
In the linear setting, we could parameterize this via conditional covariances, resulting in the modified assumption:
\begin{itemize}[itemindent=2em]
    %\item[(A$3''$)] \emph{Modified $\tau$-Exclusion:} $\text{Cov}(\Vector{Z},Y \mid X, \Vector{U}) \leq \tau$.
    \item[(A$3''$)] \emph{Modified $\tau$-Exclusion:} $\lVert \SSigma_{\Vector{z}y \mid x \Vector{u}} \rVert_p \leq \tau$.
\end{itemize}
Our method applies equally well to this setting, though we would need to reinterpret our results somewhat. 
Specifically, we would need to reformulate our structural equations such that $\lVert \Vector \gamma \rVert_p$ no longer quantifies a direct effect, but instead a more general measure of association between candidate instruments and the outcome. 
Future work will spell this out in more detail, as well as similar relaxations of (A2). 

Since $\psi$ is a differentiable function of sample moments, its sampling distribution can be consistently estimated via the bootstrap \citep{hall1992}.
Mean-centering guarantees that $\mathbb{E}[\psi^0_{(b)}]=0$ while preserving the distribution's higher moments, as expected under $H_0$.
This is standard procedure for one-sample bootstrap hypothesis testing (see, e.g., \citep[Ch.~16]{efron_tibshirani_bootstrap} and \citep[Ch.~4]{davison_hinkley}). 

We propose to test this via Monte Carlo, estimating $\check{\theta}_2$ on the original data and creating a null covariance matrix by replacing $\hat{\SSigma}_{\Vector{z}x}$ with $\hat{\SSigma}_{\Vector{z}x}^0 := \check{\theta}_2^{-1} \hat{\SSigma}_{\Vector{z}y}$. 
\begin{theorem}[\textit{Exclusion test}]\label{thm:test}
    Let $\mathcal{D}_n = \{x_i, y_i, \Vector{z}_i\}_{i=1}^n$ be a dataset generated according to the conditions of Thm. \ref{thm:id}, with $\mathcal{D}_n \sim \mathcal{N}(\bm 0, \SSigma)$ and sample estimate $\hat{\psi}_n$.
    Construct a null covariance matrix $\SSigma^0$ as detailed above. Draw $B$ synthetic datasets of size $n$, $\mathcal{D}^0_{n, (b)} \sim \mathcal{N}(\bm 0, \SSigma^0)$, and record the test statistic $\psi^0_{(b)}$ for all $b \in [B]$.
    Then as $n, B \rightarrow \infty$, the following is an asymptotically valid $p$-value against $H_0$:
    \begin{align*}
        p_{\textup{MC}} = \frac{\# \big\{b: \psi^0_{(b)} \geq \hat{\psi}_n \big\} + 1}{B+1}.
    \end{align*}
\end{theorem}


Assuming , $c_\alpha$ converges to the $1 - \alpha$ quantile of $\hat{G}_n^{\SSigma^0}$, as $n, B \rightarrow \infty$. 
Thus $\hat{G}_n^{\SSigma^0}$ is an asymptotically valid null distribution and  $p_{\text{MC}}$ satisfies
\begin{align*}
    p_{\text{MC}} = p_{\text{MC}}(\mathcal{D}_n) = \inf \{\alpha: \hat{\psi}_n \in R_\alpha\},
\end{align*}
as desired.


Let $\phi: \mathbb{R}^{n \times (2 + d_{\Vector Z})} \mapsto \{0,1\}$ be a decision procedure that takes input data and either does or does not reject $H_0$---$\phi(\mathcal{D}_n)=1$ or $\phi(\mathcal{D}_n)=0$, respectively. 
We say that $\phi$ controls the type I error at level $\alpha$ iff:
\begin{align*}
    \sup_{P_{\SSigma} \in \mathcal{M}_0} \mathbb{E}_{\mathcal{D}_n \sim P_{\SSigma}} \big[\phi(\mathcal{D}_n)\big] \leq \alpha.
\end{align*}

Let $c_\alpha(\mathcal{D}_n)$ denote the critical value at level $\alpha$ for dataset $\mathcal{D}_n$, such that, under $H_0$, the rejection region of statistics
\begin{align*}
    R_\alpha := \{\psi: |\psi| \geq c_\alpha(\mathcal{D}_n)\}
\end{align*} 
integrates to $\alpha$.
We assume that rejection regions are nested, i.e. $R_\alpha \subset R_{\alpha'}$ if $\alpha < \alpha'$.
Then $c_\alpha(\mathcal{D}_n)$ represents the $1 - \alpha/2$ quantile of the null distribution for $\mathcal{D}_n$.
The $p$-value of a test for a given dataset $\mathcal{D}_n$ is the smallest significance level at which the test statistic would fall into the corresponding rejection region:
\begin{align*}
    p(\mathcal{D}_n) := \inf \{\alpha: \hat{\psi}_n \in R_\alpha\}.
\end{align*}


A parametric procedure for testing the null hypothesis $H_0: \psi = 0$ would require a likelihood function. A more general solution is to simulate a null distribution by mean-centering the bootstrap estimates:
\begin{align*}
    \psi^0_{(b)} := \psi_{(b)} - \frac{1}{B}\sum_{b'=1}^B \psi_{(b')},
\end{align*}
and comparing these to the sample estimate $\hat{\psi}_n$.
\begin{theorem}[\textit{Exclusion test}]\label{thm:test}
    Let $\mathcal{D}_n = \{x_i, y_i, \Vector{z}_i\}_{i=1}^n$ be a dataset generated according to the conditions of Thm. \ref{thm:id}, with $d_{\Vector Z} \geq 2$. 
    Let $\hat{\psi}_n$ be the sample estimate obtained from $\mathcal{D}_n$. 
    Then as $n, B \rightarrow \infty$, the following is an asymptotically valid $p$-value against $H_0$:
    \begin{align*}
        p_{\textup{boot}} = \frac{\# \big\{b: |\psi^0_{(b)}| \geq |\hat{\psi}_n| \big\} + 1}{B+1}.
    \end{align*}
\end{theorem}

Next, we find the corresponding value(s) of $\rho$.
%By Lemma \ref{lemma:theta}, we know that any $\check{\theta}_p$ must correspond to a unique value of $\rho$. Our next step is to solve for this parameter.
\begin{lemma}[\textit{Minimum leakage as a function of confounding}]
\label{lemma:rho_star} 
    Define $h := g \circ f$, such that $h: [-1, 1] \mapsto \mathbb{R}^{d_{\Vector{Z}}}$ is a function from the confounding coefficient $\rho$ to the vector of linear weights $\Vector{\gamma}$. 
    For any $p \geq 1$ and $\check{\theta}_p$ (either a unique solution or any point on a compact interval of solutions), $\lVert h(\rho) \rVert_p$ achieves its minimum at:
    \begin{align*}%\label{eq:tau_formula}
        \check{\rho}_p &:= \argmin_{\rho \in [-1, 1]} ~\lVert h(\rho) \rVert_p = f^{-1}(\check{\theta}_p)\\ 
        %&= - \text{sgn}(\check{\theta}_p) \Bigg[ \frac{(\psi^2 / \eta_x^2) - 2 \psi \check{\theta}_p + \eta_x^2 \check{\theta}_p^2}{\phi^2 - 2 \psi \check{\theta}_p + \eta^2_x \check{\theta}_p^2} \Bigg]^{1/2}.
        % &= \frac{- \text{sgn}(\check{\theta}_p) ~|\KappaXY - \eta_x^2 \check{\theta}_p|}{\sqrt{\eta_x^2 \big(\KappaYY + \check{\theta}_p (-2 \KappaXY + \eta_x^2 \check{\theta}_p) \big)}}.
        &= \text{sin}\Bigg(\text{arctan}\Bigg(\frac{\KappaXY - \check{\theta}_p \KappaXX}{\sqrt{\KappaXX\KappaYY - \KappaXY^2}}\Bigg)\Bigg).
    \end{align*}
\end{lemma}

By Lemma \ref{lemma:rho_star}, we know that $\lVert \bm \gamma \rVert_p$ is minimized when the confounding coefficient is $\check{\rho}_p$.
With $\tau > \check{\tau}_p$, there will be two solutions to the equation $\tau = \lVert h(\rho) \rVert_p$, one on either side of $\check{\rho}_p$. 
We therefore partition the domain of $\rho$ into two intervals at this split point (or set, for non-unique $\check{\rho}_p$). Since, by Lemma \ref{lemma:theta}, $\theta$ is a decreasing function of $\rho$, we know that $\theta^+$ will lie on the first interval and $\theta^-$ on the second. 
The loss function $R_{\tau, p}(\rho) = \big( \tau - \lVert h(\rho) \rVert_p \big)^2$ is quadratic and therefore strictly convex, guaranteeing a global optimum for each task. 
That is, minimizing $R_{\tau, p}$ on the interval $[-1, \check{\rho}_p^-)$ will recover $\theta^+$, and minimizing on the interval $(\check{\rho}_p^+, 1]$ will recover $\theta^-$.
The tightness of resulting bounds follows immediately from convexity, as the solution to each optimization problem is global. 

\begin{figure}
    \centering
    \begin{tikzpicture} [node distance=10mm,>=stealth',sh/.style={shade}]
    \node [events] (U) [sh] {$U$} ;
    \node [events, below left = of U ] (X) {$X$};
% command to supply a shorter title of your paper so that it can be
    \node [events, below right = of U ] (Y) {$Y$};
    \node [events,  below   = of X ] (z2) {$Z_{j-1}$};
    \node [events,  left = of z2 ] (z1) {$Z_1$};
    \node [events,  below  = of Y ] (z3) {$Z_j$};
    \node [events,  right = of z3 ] (z4) {$Z_{d_Z}$};
    \draw [->] (U) to [out=-150, in=60]  (X);
    \draw [->] (U) to [out=-30, in=120]  (Y);
    \draw [->] (X) to  (Y);
    \draw [->] (z1) to  (X);
    \draw [->] (z2) to  (X);
    \draw [->] (z3) to  (X);
    \draw [->] (z4) to  (X);
    \draw [dashed, ->] (z3) to (Y);
    \draw [dashed , ->] (z4) to  (Y);
    \draw [white](z2) to node[midway, black] {. . .} (z1);
    \draw [white](z3) to node[midway, black] {. . .} (z2);
    \draw [white](z4) to node[midway, black] {. . .} (z3);
    \end{tikzpicture}
    \caption{Causal diagram with treatment $X$, outcome $Y$, unobserved confounder $U$ (shaded), and candidate IVs $Z_1, \dots, Z_{d_Z}$. Dashed edges suggest possible violations of the exclusion criterion.}\vspace{-3mm}
\end{figure}\label{fig:dag}


\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/Dz_and_tau_vs_ate.pdf}
  \vspace{-2mm}
  \caption{Bounds vary with information leakage and data dimensionality. 
  For each choice of $\tau$, we sample a population covariance matrix using the parameters listed in Appx.~\ref{appx:bound_width}.  
  Circles represent the estimates of a $\SSigma$-oracle, while curves and shading represent means and standard deviations of the LeakyIV estimator across 200 bootstraps. 
  The horizontal grey line at $\theta=1$ denotes the true ATE, 
  and the dashed vertical line at $\tau=1$ denotes the oracle threshold $\tau_2^*$.  
  \textbf{(A)} Bounds naturally increase with $\tau$, but are invalid below the oracle threshold $\tau^*_2$.
  \textbf{(B)} The variance of the bounding estimates increases with $d_{\Vector{Z}}$.}
  \vspace{-2mm}
  \label{fig:bounds}
\end{figure}




By convexity, we know that the equation has at least one solution on either side of $\check{\rho}_p$. Moreover, by 

Since $\check{\rho}_p$ may not be unique for $p \in \{1, \infty\}$, we define the interval's extrema:
\begin{align*}
    \check{\rho}_p^- := \min \check{\rho}_p, ~\quad~ \check{\rho}_p^+ := \max \check{\rho}_p,
\end{align*}
with the understanding that these two values will be identical for all $p \in (1, \infty)$. With these steps, we have all the ingredients in place to state our main result. 

\subsection{Bound Width and Uncertainty}
For a fixed dataset, bound width is an increasing function of the leakage threshold $\tau$.
Let $\theta^*$ denote the true ATE, with corresponding leakage weights $\Vector{\gamma}^* = g(\theta^*)$. 
Let $\tau^*_p := \lVert \Vector{\gamma}^* \rVert_p$ denote the ``oracle'' $\tau$-value, so named because it quantifies the precise (and unidentifiable) amount of information leakage from $\Vector{Z}$ to $Y$ in the true data generating process. 
Together with the theoretical minimum $\check{\tau}_p^*$, this value defines a three-partition of the threshold space:
\begin{itemize}[noitemsep]
    \item $\tau \leq \check{\tau}_p^*$: \textit{the infeasible region}. No bounds result.
    \item $\tau \in (\check{\tau}_p^*, \tau^*_p)$: \textit{the error region}. Bounds are computable in this range, but guaranteed to miss the true ATE.
    \item $\tau \geq \tau^*_p$: \textit{the valid region}. Bounds are valid throughout this range, and grow increasingly conservative with $\tau$.
\end{itemize}
%Setting thresholds in between the theoretical minimum $\check{\tau}_2$ and the oracle value $\tau^*_2$ results in invalid bounds that miss the true ATE. %  *might*  miss the true ATE? 
%Above $\tau^*_2$, we find increasingly conservative bounds that are guaranteed to cover this parameter. 
These three regions are visualized in Fig. BLAH for $p=2$, where we see how bounds go from nonexistent (grey striped region) to small but erroneous (red shaded region), only becoming valid above the leakage threshold $\tau^*_2$. The quadratic curve represents the true $\theta$-$\lVert \Vector \gamma \rVert_2$ relationship, and so any bounds resulting from a horizontal line across this curve can be considered the work of a \textit{partial identification oracle} (henceforth a $\SSigma$-oracle), i.e. one with knowledge of the population covariance matrix for observed variables $\{X, Y, \Vector{Z}\}$. 
Note that even with access to the true $\SSigma$, $\tau^*_2$ and $\theta^*$ remain unidentifiable, and so we distinguish between $\SSigma$-oracles and oracles \textit{tout court}, who are additionally omniscient with respect to latent parameters and therefore able to point identify the ATE. 
The vertical blue line represents the true ATE $\theta^*$, which is guaranteed to intersect with either the lower or upper bound delivered by a $\SSigma$-oracle at $\lVert \Vector \gamma \rVert_2 = \tau^*_2$. Decreasing ATE intervals below this point miss the target $\theta^*$, while intervals above it grow increasingly conservative, as larger amounts of information leakage are consistent with a wider ranger of possible models. 
We observe that, unlike some other partial identifiability methods designed for IV models \citep{ramsahai2012}, bounds output by LeakyIV are not guaranteed to contain zero.

The stability of our estimator depends primarily on the model degrees of freedom. With more samples and fewer candidate IVs, covariance estimates $\hat{\SSigma}$ grow closer to the population parameters $\SSigma$; with fewer samples and more candidate IVs, estimates become noisier. 

To evaluate the performance of our estimator, we perform a simple simulation experiment. 

We begin with a simple simulation experiment to evaluate the performance of our estimator. In Fig.~\ref{fig:bounds}, we plot true and estimated bounds over a range of leakage thresholds and data dimensionalities. 
Curves and shading represent means and standard deviations of the LeakyIV estimator across 200 runs. 
The horizontal line at $\theta=1$ represents the true ATE $\theta^*$.


In Fig.~\ref{fig:bounds}\textbf{(B)}, we show how bounds evolve as a function of the number of candidate instruments $d_{\Vector{Z}}$ for a fixed sample size of $2000$. 
Though the $\SSigma$-oracle bounds fluctuate within a fairly tight range, the bounds estimated by LeakyIV grow increasingly uncertain as larger values of $d_{\Vector{Z}}$ lead to noisier estimates of $\hat{\SSigma}$.  
% With sample size fixed at $n=2000$ in this experiment, 
% larger values of $d_{\Vector{Z}}$ lead to noisier estimates of $\hat{\SSigma}$. 


Observe that by fixing the weights $\bm{\beta}, \bm{\gamma}$, we can immediately solve for the residual correlation coefficient $\rho$ via the following equations:
\begin{align*}
    \epsilon_x &= X - \bm{Z \beta}, \quad \text{Var}(\epsilon_x) = \eta_x^2\\
    \epsilon_y &= Y - \bm{Z \gamma}, \quad \text{Var}(\epsilon_y) = \eta_y^2\\
    \rho &= \frac{\text{Cov}(\epsilon_x, \epsilon_y)}{\eta_x \eta_y}.
\end{align*}



%\begin{align*}
%    \max_{j \in [d_Z]} |\beta_j - \gamma_j| > \tau.
%\end{align*}
\begin{align*}
    \lVert \Vector{\beta} - \Vector{\gamma} \rVert_p \in \big(\lVert \Vector{\beta} \rVert_p,  \tau \big].
\end{align*}

\begin{align*}
    \lVert \Vector{\beta} \rVert_p - \lVert \Vector{\gamma} \rVert_p > 0 %\in \big(\lVert \Vector{\beta} \rVert_p,  \tau \big].
\end{align*}



More importantly, we can easily put these components together into a single loss function that is convex in $\rho$ on each interval of the partition $[-1, \rho^*), (\rho^*, 1]$.

. Specifically, define $\ell(\rho) := \big(\tau - h(\rho)\big)^2$
We may characterize our SEM by the relationships between observed variables and learnable parameters (as opposed to Eqs. \ref{eq:scmx}, \ref{eq:scmy} where we invoke unobserved residuals $\epsilon_x, \epsilon_y$).
Incorporating these structural constraints, our optimization objective is formalized as follows:
%Assume, without loss of generality, that all features are mean-centered. Then our optimization objective can be formalized as follows:
\begin{align}
    \underset{\rho \in [-1,1]} {\text{min/max}} \quad &\theta\nonumber \quad\\
    \text{s.t.} \quad \lVert \Vector{\gamma} \rVert_p &\leq \tau\label{eq:tau_constraint} \\
    \SSigma_{\Vector{z}y} &= \SSigma_{\Vector{z}x}\theta +\SSigma_{\Vector{z}\Vector{z}} \Vector{\gamma}\label{eq:zy_constraint} \\
    \SSigma_{xy} &= \SSigma_{xx}\theta + \SSigma_{x\Vector{z}}\Vector{\gamma} + \eta_x\eta_y\rho \label{eq:xy_constraint} \\
    \SSigma_{yy} &= \theta\SSigma_{xx}\theta + \Vector{\gamma}^\top\SSigma_{\Vector{z}\Vector{z}}\Vector{\gamma}\nonumber\\
    &\quad + 2\theta(\SSigma_{x\Vector{z}}\Vector{\gamma} + \eta_x\eta_y\rho)  + \eta^2_y,  \label{eq:yvar_constraint}
\end{align}

with $\eta_x, \eta_y \geq 0$, by definition. This optimization problem is non-convex, and therefore not amenable to linear or quadratic programming techniques. However, we show that $\theta$ and $\Vector{\gamma}$ can be calculated in closed form for fixed values of $\rho$, which means that global optima can be efficiently computed via grid search.

For ease of notation, define the following three scalars:
\begin{align*}
    a &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}x} - \SSigma_{xx}\\
    b &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}y} - \SSigma_{xy}\\
    c &:= \SSigma_{y\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-1}\SSigma_{\Vector{z}y} - \SSigma_{yy}. 
    % a &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zx}} - \sigma^2_x\\
    % b &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} - \sigma_{xy}\\
    % c &:= \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} - \sigma^2_y. 
    %a &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zx}} - \sigma^2_x \quad &d := \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zx}} \\
    %b &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} - \sigma_{xy} \quad &e := \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zy}} \\
    %c &:= \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} - \sigma^2_y \quad &f := \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zy}}
    %    a &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zx}} \quad &d :=
    %    \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zx}}\\
    %    b &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zy}} \quad &e := \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} \\
    %    c &:= \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-2}\mathbf{\Sigma}_{\mathbf{Zy}} \quad &f := \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}}
\end{align*}
These values are observable and can be estimated directly from the data. 
%Additionally, it can be shown than \(a=-\eta^2_x \leq 0\). 
%\todo{\(\Sigma_{xZ}=\beta \Sigma_Z)\) \\and\\ \(\sigma^2_x=\beta^T\Sigma_Z\beta- \eta^2_x)\)}
%Moreover, lower bounds for each are well defined (see Appendix for all proofs). 
%\begin{lemma}\label{lem:lemma1}
%It can be shown that $a,d>0$ and $b,c,e,f \geq 0$.
%\end{lemma}
Our first result characterizes the relationship between $\rho$ and $\theta$, independent of any constraints on $\Vector{\gamma}$. 
\begin{theorem}[Causal effects and confounding]\label{thm:quad}
    % Define:
    % \begin{equation*}
    % %     \begin{aligned}
    % %     d &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}} \\
    % %     e &:= \mathbf{\Sigma}_{\mathbf{xZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zx}} \\
    % %     f &:= \mathbf{\Sigma}_{\mathbf{yZ}}\mathbf{\Sigma}_{\mathbf{Z}}^{-1}\mathbf{\Sigma}_{\mathbf{Zy}}
    % % \end{aligned} \quad
    % \begin{aligned}
    %     g &:= (d - \sigma^2_{\mathbf{x}})(1 + \frac{d - \sigma^2_{\mathbf{x}}}{\eta^2_x\rho^2}) \\
    %     h &:= -(e - \sigma_{\mathbf{xy}})(1 + \frac{d - \sigma^2_{\mathbf{x}}}{\eta^2_x\rho^2}) \\
    %     i &:= \frac{(e - \sigma_{\mathbf{xy}})^2}{\eta^2_x\rho^2} + f - \sigma^2_{\mathbf{y}}
    % \end{aligned}
    % \end{equation*}
    %From Eqs. \eqref{eq:xy_constraint} and \eqref{eq:yvar_constraint}, and given $\rho^2 \leq 1 $ and $ \eta_x, \eta_y \geq 0$:
    From Eqs. \eqref{eq:zy_constraint}-\eqref{eq:yvar_constraint}, together with natural constraints on $\rho, \eta_x$, and $\eta_y$, we have that:
    % \begin{equation*}
    %     g\theta^2 + 2h\theta + i = 0,
    % \end{equation*}
    % and therefore, by the quadratic formula,
    % \begin{equation*}
    %     \theta = \frac{-h \pm \sqrt{h^2 - gi}}{g}.
    % \end{equation*}
    % \text{to be precise, }
    \[
    \theta= \frac{b}{a}- \textup{sgn}(\rho)~ \frac{\sqrt{\left(b^2-ac\right)\big(1+a/(\eta^2_x\rho^2)\big) }}{a\big(1+a/(\eta^2_x \rho^2) \big)}.
    %\xi(\rho),
%\begin{cases}
    %-\eta^{-2}_x(e-\sigma_{xy}) - \delta(\rho), & \text{if } \rho > 0\\
    %-\eta^{-2}_x(e-\sigma_{xy}) + \delta(\rho), & \text{if } \rho < 0
%    b/a - \xi(\rho), & \text{if } \rho>0\\
%    b/a + \xi(\rho), & \text{if } \rho<0
%\end{cases}
\]
    %where: 
    %\begin{align*}
    %    \xi(\rho):=\frac{\sqrt{\left(b^2-ac\right)\big(1+a/(\eta^2_x\rho^2)\big) }}{a\big(1+a/(\eta^2_x \rho^2) \big)}.
        %\delta(\rho)=\frac{\sqrt{\left(b^2-ac\right)(1+\sfrac{a}{\eta^2_x\rho^2}) }}{a(1+\sfrac{a}{\eta^2_x \rho^2})}.
    %\end{align*}
    %\begin{align*}
    %    \delta(\rho) = \frac{ \sqrt{\Big[(e-\sigma_{xy})^2 + \eta^{2}_x(f-\sigma^2_y) \Big] \Big (1+\frac{-\eta^{2}_x}{ \eta^2_x\rho^2} \Big) }}{-\eta^{2}_x \Big( 1+\frac{-\eta^{2}_x}{ \eta^2_x \rho^2} \Big)}.
    %\end{align*}
\end{theorem}

%In other words, $\theta$ is an inverse of quadratic function of $\rho$---
Thm.~\ref{thm:quad} states that with fixed covariance matrix $\mathbf{\Sigma}$, \(\theta\) is a deterministic, bijective, decreasing function of \(\rho\). Specifically, BLAH although the functional form %(choice of which root to follow) 
depends on whether $\rho$ is positive or negative (See Fig.~\ref{fig:rho_alpha}). According to \eqref{eq:xy_constraint}, \(\eta_y=\frac{a\theta-b}{\rho\eta_x}\). With $\eta_x, \eta_y \geq 0$, \(a\theta-b\) and $\rho$ have the same sign. Given \(a=-\eta^2_x \leq 0\), \(\theta \leq \frac{b}{a}\), if and only if $\rho >0$ and vice versa.\footnote{Also, note that when $\rho = 0$, there is no unobserved confounding and $\theta$ can simply be estimated by OLS (if $\lVert \bm{\gamma} \rVert_0 = 0$) or backdoor adjustment on $\bm{Z}$ (if $\lVert \bm{\gamma} \rVert_0 > 0$).}  The resulting relationship %describes a sigmoidal\todo{hyperbolic?} curve, as depicted in Fig.~\ref{fig:rho_alpha}. Crucially, this relationship 
is bijective function which allows for identifying ATE ($\theta)$, should we know $\rho$, the degree of confounding between $\bm{X}$ and $Y$. %Thus, if we knew the precise degree of confounding between $\bm{X}$ and $Y$, the ATE would be immediately identified. Since $\bm{U}$ is unobserved in practice, 
Although $\rho$ cannot be estimated from data, %. However, its value is naturally bounded on $[-1,1]$. Background 
domain knowledge may help to restrict its range and define boundaries of causal effect.
%still further, for instance if we have reason to believe that $\epsilon_x$ and $\epsilon_y$ are positively correlated. 

%\todo{AM: this is not strictly correct. We are dealing with a system of equations. and $\theta$ is in the intersection of solutions of different equations in this system. The idea is that based on domain knowledge we have some ideas about counfounding effect $\theta$ and potential leak from IV to output ($\gamma$), and based on this domain knowledge we want to estimate boundaries of causal effect.} The parameter $\eta^2_x$ is not directly observed but can be estimated by linear regression (potentially penalized) of $X$ on $\bm{Z}$, since it denotes the residual variance of the model $\mathbb{E}[X|\bm{Z}] = \bm{Z}\bm{\beta}$ (see Eq.~\ref{eq:scmx}.)

\begin{figure}
  \centering
  \includegraphics[width=0.4\linewidth]{figures/rho_theta.pdf}
  \caption{Example of a $\rho$-$\theta$ curve visualizing the relationship between unobserved confounding and average causal effects. By Thm.~\ref{thm:quad}, this function takes the shape of a rotated sigmoid, with $\theta$ tracing the greater root of a quadratic formula for $\rho < 0$, and the lesser root for $\rho > 0$.}
\end{figure}\label{fig:rho_alpha}

Because strong confounding induces extreme values of $\theta$, informative bounds on the ATE can be derived by truncating the range of $\rho$. This is precisely what (A$3'$) achieves, although the exact form of this truncation depends on our choice of norm. An important observation to make at this point is that our structural assumptions entail the following identity:
\begin{equation}\label{eq:gamma}
\Vector{\gamma} = \SSigma^{-1}_{\Vector{z}\Vector{z}} (\SSigma_{\Vector{z}y} - \SSigma_{\Vector{z}x}\theta).
\end{equation}
%Since, by Thm.~\ref{thm:quad}, $\theta$ is a deterministic function of $\rho$, Eq.~\ref{eq:gamma} effectively renders $\bm{\gamma}$ a deterministic function of $\rho$ as well. 
This immediately suggests a sort of hierarchical model, in which $\rho$ fixes $\theta$ in accordance with Thm.~\ref{thm:quad}, while $\theta$  identifies $\bm{\gamma}$ in accordance with Eq.~\ref{eq:gamma}. 
%(Recall that all covariance parameters can be estimated directly from the data.) 
With this insight, we can define any Boolean condition on $\bm{\gamma}$ and bound the resulting ATE by checking which corresponding values of $\rho$ produce weights that satisfy the condition. This motivates the generic rejection sampling procedure outlined in Alg.~\ref{alg:leaky_iv}. For concreteness, we work through the details of two special cases of interest, bounding $L_2$ and $L_1$ norms of $\bm{\gamma}$.

Results of Thm.~\ref{thm:quad} allows us to rewrite the problem of bounding the causal effect for a linear setting as: 
\begin{align}
    \underset{\rho \in [-1,1]} {\text{min/max}} \quad \theta \quad \quad 
    \text{           s.t.} \quad \lVert f(\theta) \rVert_p \leq \tau \quad \text{ and  }\quad
    \theta=g(\rho)\label{eq:theta_rho_constraint}
\end{align}
where \(f(\theta)= \SSigma^{-1}_{\Vector{z}\Vector{z}} (\SSigma_{\Vector{z}y} - \SSigma_{\Vector{z}x}\theta)\), from \eqref{eq:zy_constraint}. This reformulation of the system of equations, \eqref{eq:tau_constraint}-\eqref{eq:yvar_constraint}, simplifies the problem of bounding the causal effect to finding minimum and maximum of $\theta$ in its feasible space; the intersection of two spaces defined by $\lVert f(\theta) \rVert_p \leq \tau$ and $\theta=g(\rho)$. We studied the second space in The second space in Thm.~\ref{thm:quad}. The first space, feasible set of $\theta$ defined by $\lVert f(\theta) \rVert_p \leq \tau$ is concerned with the effect of $Z$s, leaky Instrumental Variables on $y$.

\subsection{Bounding $L_2$ and $L_1$ Norms}
Intuitively... 

%The norm $\lVert \bm{\gamma} \rVert_p$ varies as a deterministic function of $\theta$ along different curves for each value of $p$. 
%Each threshold $\tau$ cuts a straight line across these curves, intersecting at two points and giving a unique range of possible values for the ATE. Call this the $(\tau, p)$-feasible interval. If resulting weights $\theta, \bm{\gamma}$ satisfy Eqs. \eqref{eq:xy_constraint} and \eqref{eq:yvar_constraint}, then our optimization problem is solved. However, this will not necessarily hold for very large or small values of $\tau$.

Bounding the $L_2$ or $L_1$ norm amounts to placing a ridge or lasso penalty on $\bm{\gamma}$, respectively, thereby generating $\theta$-$\lVert \bm{\gamma} \rVert_p$ curves with some notable geometric properties. First, consider the case where (A3$'$) holds with $p=2$. 
For ease of notation, define:
\begin{align*}
    d &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-2}\SSigma_{\Vector{z}x} - \SSigma_{xx}\\
    e &:= \SSigma_{x\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-2}\SSigma_{\Vector{z}y} - \SSigma_{xy}\\
    f &:= \SSigma_{y\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-2}\SSigma_{\Vector{z}y} - \SSigma_{yy}. 
\end{align*}
These scalars can all be estimated directly from the data, and \(d=\lVert\beta\rVert^2_2 \geq 0\). We use them to characterize the $(\tau, 2)$-feasible interval for $\theta$ as follows. 
\begin{theorem}[$(\tau, 2)$-feasible interval]\label{thm:l2feasible}
    Assume (A3$'$) holds with $p=2$. Then, by Eqs. \ref{eq:tau_constraint} and \ref{eq:zy_constraint}, for all $\tau \geq f - e^2/d$ \footnote{For a valid IV, the lower bound of $\theta$ is zero.}, we have:
    \begin{equation*}
        \frac{e}{d} -\chi(\tau) \leq \theta \leq \frac{e}{d} + \chi(\tau), \quad \chi(\tau)=\frac{\sqrt{d(\tau - f) + e^2}}{d} \geq 0
    \end{equation*}
\end{theorem}
%Observe that $\Sigma_Z^2$ is by definitions positive semidefinite, which ensures that 
\noindent %Since $d > 0$, these bounds are always well-defined. 
$\chi(\tau)$ is an increasing function of $\tau$ (leakage from quasi IVs).This implies that  
\\As we can see in Fig.~\ref{fig:tau_p_feasible}, the resulting region is parabolic with a vertex at $\theta = e/d, ~\lVert \bm{\gamma} \rVert_2 = f - e^2/d$. Note that when $\lVert \bm{\gamma} \rVert_0 = 0$, $e/d$ also represents the 2SLS ATE estimate, which is the unique solution for $\theta$ in the linear IV setting under (A1)-(A3).
An immediate corollary of Thm.~\ref{thm:l2feasible} is that for $\tau > f$, $\theta=0$ is within the feasible interval. Intuitively, by increasing $\tau$, we allow a larger portion of $Y$'s variance to be explained by $\bm{Z}$, thereby decreasing our confidence in $X$'s causal influence. Beyond the $f$-threshold, the uncertainty is sufficiently large that $X$ may have a positive or negative effect on $Y$---or perhaps none at all.\\
\begin{theorem}[Minimum leakage for consistency with observational data]\label{thm:min_tau_r}
    Assume (A3$'$) holds with $p=2$. Then, from Thms.~\ref{thm:quad} and \ref{thm:l2feasible}\, and applying KKT on ~\ref{eq:theta_rho_constraint}, we have (the second degree applies to both maximisation and minimisation targets):
    \begin{equation}
        \hat{\tau}(\rho)=d\left[\frac{b}{a}- \frac{e}{d}-sign(\rho)\xi(\rho)\right]^2-\frac{e^2}{d}+f
    \end{equation}
where $\hat{\tau}$ is the minimum value of $\tau$ consistent with the observational distribution. 
\begin{equation}
        \frac{\partial\hat{\tau}(\rho)}{\partial\rho}=\frac{4ad  }{\eta_x^2\rho^3(1+\frac{a}{\eta_x^2\rho^2})}sign(\rho)\xi(\rho)\Bigl\{\frac{b}{a}- \frac{e}{d}-sign(\rho)\xi(\rho)\Bigl\}
    \end{equation}\label{eq:1st_der_tau}
    \begin{equation}\label{eq:2nd_der_tau}
    \frac{\partial^2\hat{\tau}(\rho)}{\partial^2\rho}=\frac{2d}{\rho^6(1-\frac{1}{\rho^2})^2} sign(\rho)\xi(\rho)\Bigl \{sign(\rho)\xi(\rho)(3\rho^2+1)-2(\frac{b}{a}- \frac{e}{d})\Bigl\}
    \end{equation}
\end{theorem}
Results:\begin{enumerate}
    \item \text{ In general,} $\lim_{\rho \to 0} \hat{\tau}(\rho)= f=\SSigma_{y\Vector{z}}\SSigma_{\Vector{z}\Vector{z}}^{-2}\SSigma_{\Vector{z}y} - \SSigma_{yy}$. For a valid IV setting, $f=0$.  
    \item One of the solutions for  \[\frac{\partial\hat{\tau}(\rho)}{\partial\rho}\bigg|_{\hat{\rho}_0}=0\] is:
    \begin{equation}
        \hat{\rho}_0=sign(\rho)\sqrt{\frac{a}{\eta^2_x \Bigl \{ \frac{b^2-ac}{a^2(\frac{b}{a}-\frac{e}{d})^2}-1 \Bigl\}}}
\end{equation}\label{eq:r_1tau}
\item \begin{equation}%:eq_tauhat_t0
        \hat{\tau}(\rho)_{\theta=0}=f \text{ and it is independent of }\rho.
    \end{equation}
\end{enumerate}



SOMETHING ON TIGHTNESS OF THESE BOUNDS WHEN NORMS ARE ANALYTIC FUNCTION OF THETA? 
%\color{orange}[David, let me go through this]

\begin{figure}
\centering
\includegraphics[width=.5\textwidth]{figures/norm_curves.png}
\caption{Example norm curves as a function of $\theta$ for $p=2$ (top row) and $p=1$ (bottom row), with $d_Z=1$ (left column) and $d_{\Vector{Z}}=4$ (right column).}
\label{fig:tau_p_feasible}
\end{figure}

\subsection{Coverage}

Standard errors for covariance estimates can be computed via classical formulae, but the implications for the uncertainty of resulting ATE bounds depends on how exactly we choose to relax (A3). Though we have focused thus far on $\tau$-exclusion criteria with $L_2$ and $L_1$ norms, some applications call for subtler notions of inclusion.
For instance, instruments may fall into natural groups that violate (A3) to varying degrees, as in Mendelian randomization studies where pleiotropic effects vary across genomic regions. In such cases, it may be preferable to apply different thresholds for each $\gamma_j \in \bm{\gamma}$, or to mix norms as in elastic net regression \citep{zh_elasticnet_2005}. To accommodate more generic forms of information leakage from $\bm{Z}$ to $Y$, we propose a parametric bootstrapping procedure that simultaneously stabilizes estimates via averaging and provides standard errors for inference (see Alg.~\ref{alg:leaky_iv}). 

Briefly, the method works as follows. Let $R$ be a set of candidate $\rho$-values, e.g. 100 evenly spaced points along $[-1,1]$. Let $\mathcal{I}: \mathbb{R}^{d_Z} \mapsto \{0,1\}$ be an indicator function on $\bm{\gamma}$---potentially composed of arbitrary conjunctions, disjunctions, etc.---that evaluates to $1$ iff the leakage from $\bm{Z}$ to $Y$ is not too severe, as in our $\tau$-exclusion criterion (A3$'$). We draw some large number of independent bootstrap samples $\mathcal{D}^{(b)}, b \in \{1, \dots, B\}$. For each dataset, we compute a range of values for $\theta^{(b)}, \bm{\gamma}^{(b)}$---one for each $\rho \in R$---and record the minimum and maximum ATE, $\theta_-^{(b)}, \theta_+^{(b)}$, for all $\rho$ such that $\mathcal{I}(\bm{\gamma}^{(b)}_\rho)=1$. We denote the resulting mean and standard error by $\hat{\theta}_*$ and $\hat{\sigma}_*$, respectively, for $* \in \{-, +\}$. This procedure provides the following coverage guarantee:
\begin{theorem}[Coverage]\label{thm:covg}
    Fix the target level $\alpha \in (0, 0.5)$. As $|R|, B \rightarrow \infty$, we have:
    \begin{align*}
        \mathbb{P}\big(\theta_* \in \hat{\theta}_* \pm \hat{\sigma}_* \times \Phi(1 - \alpha/2)\big) \geq 1 - \alpha,
    \end{align*}
    where $\Phi$ denotes the standard normal CDF.
\end{theorem}
This licenses statistical inference in a straightforward manner. For instance, we reject the null hypothesis $H_0: \theta \leq 0$ iff $\alpha \times 100\%$ of the sampling distribution for $\hat{\theta}_-$ is positive. For the point null $H_0: \theta = 0$, we reject iff $0 \not\in [\hat{\theta}_- - \hat{\sigma}_- \times \Phi(\alpha), ~\hat{\theta}_+ + \hat{\sigma}_+ \times \Phi(1 - \alpha)]$. These rules ensure that the false positive rate is uniformly bounded at level $\alpha$.\footnote{Though we employ frequentist methods both here and in the following experiments, we also implement a Bayesian bootstrap in the accompanying \texttt{R} package, where instead of sampling data points with replacement, prior weights are sampled from a flat Dirichlet distribution \citep{rubin1981}.} 


%\begin{align*}
%    \sup_{\theta_* \in \Theta_0} \mathbb{P}_{\Theta_0}(\hat{\theta}_* \in \Theta_1) \leq \alpha,
%\end{align*}
%where $\Theta_0, \Theta_1$ denote a partition of the parameter space into null and alternative regions, respectively.

%Though this exposition has been entirely frequentist, we can easily replace the classical bootstrap with a Bayesian variant \citep{rubin1981}. In this case, rather than sampling datapoints, we sample prior weights from a flat Dirichlet distribution and use these to compute weighted estimates of subsequent parameters. This results in a Bayesian credible interval with similar inferential properties to its frequentist counterpart. While we use the classic bootstrap in our experiments, both methods are implemented in the accompanying \texttt{R} package.


%By solving Thm.~\ref{thm:quad} across the full range of $\rho$, which is naturally bounded on $[-1, 1]$, we describe a parabolic function. When the codomain of this function overlaps with the $\tau$-feasible region defined in Thm.~\ref{thm:feasible}, the geometric result is the intersection of two parabaloids, a surface that may be smooth or discontinuous depending on the covariance structure of the data (see Fig. BLAH). We then solve the optimization problem in Eq.~\ref{eq:opt} by grid search over $\tau$ and $\rho$, as described in Alg.~\ref{alg:soft_iv}.



\begin{algorithm}
\caption{$\texttt{leakyIV}$}\label{alg:leaky_iv}
\small
\textbf{Input:} Data $\mathcal{D} = \{\bm{z}_i, x_i, y_i\}_{i=1}^n$; number of bootstraps $B$

\begin{algorithmic}[1]
\STATE Initialize: $\texttt{theta\_lo} \gets [~], ~\texttt{theta\_hi} \gets [~]$
\FOR{$b \in \{1, \dots, B\}$}
    \STATE Draw bootstrap sample $\mathcal{D}^{(b)}$
    \STATE Estimate covariance matrix $\mathbf{\hat{\Sigma}}^{(b)}$
    \FOR {$\rho \in R$}
        \STATE Compute $\theta^{(b)}_\rho$ from Thm.~\ref{thm:quad}
        \STATE Compute $\bm{\gamma}^{(b)}_\rho$ from Eq.~\ref{eq:gamma}
        \IF {$\mathcal{I}(\bm{\gamma}^{(b)}_\rho)=1$}
            \IF {$\theta^{(b)}_\rho <$ \texttt{theta\_lo}[$b$]}
                \STATE \texttt{theta\_lo}[$b$] $\gets \theta^{(b)}_\rho$
            \ELSIF{$\theta^{(b)}_\rho >$ \texttt{theta\_hi}[$b$]}
                \STATE \texttt{theta\_hi}[$b$] $\gets \theta^{(b)}_\rho$
            \ENDIF
       \ENDIF
    \ENDFOR
\ENDFOR
\STATE $\hat{\theta}_- \gets$ \texttt{mean}(\texttt{theta\_lo})
\STATE $\hat{\sigma}_- \gets$ \texttt{sd}(\texttt{theta\_lo})
\STATE $\hat{\theta}_+ \gets$ \texttt{mean}(\texttt{theta\_hi})
\STATE $\hat{\sigma}_+ \gets$ \texttt{sd}(\texttt{theta\_hi})
\end{algorithmic}
\textbf{Output:} $\hat{\theta}_-, \hat{\sigma}_-, \hat{\theta}_+, \hat{\sigma}_+$
\end{algorithm}


Full details of our simulation setup are described in Appx.~\ref{appx:exp}. 
Briefly, we vary the confounding coefficient $\rho$, the 

Briefly, we vary the following parameters:
\begin{itemize}[noitemsep]
    %\item the covariance matrix $\SSigma_{\Vector{z}\Vector{z}}$ is either diagonal or Toeplitz;
    %\item the proportion of valid instruments (i.e., non-leaky) ranges from $0$ to $\sfrac{1}{2}$; 
    \item the SNR for Eqs.~\ref{eq:scmx} and \ref{eq:scmy} ranges from $\sfrac{1}{3}$ to $3$;
    \item the level of unobserved confounding $\rho$ ranges from $\sfrac{1}{4}$ to $\sfrac{3}{4}$, with random sign; 
    \item the ATE $\theta$ has random sign and explains anywhere from $\sfrac{1}{3}$ to $\sfrac{2}{3}$ of the signal variance for $Y$. 
\end{itemize}
For each setting, we draw a population of $10^4$ samples and run each method on $10$ random subsets of $n = 1000$. 


\begin{figure}
    \centering
    \begin{tikzpicture} [node distance=10mm,>=stealth',sh/.style={shade}]
    \node [events] (U) [sh] {$U$} ;
    \node [events, below left = of U ] (X) {$X$};
% command to supply a shorter title of your paper so that it can be
    \node [events, below right = of U ] (Y) {$Y$};
    \node [events,  below   = of X ] (z2) {$Z_{j-1}$};
    \node [events,  left = of z2 ] (z1) {$Z_1$};
    \node [events,  below  = of Y ] (z3) {$Z_j$};
    \node [events,  right = of z3 ] (z4) {$Z_{d_Z}$};
    \draw [->] (U) to [out=-150, in=60]  (X);
    \draw [->] (U) to [out=-30, in=120]  (Y);
    \draw [->] (X) to  (Y);
    \draw [->] (z1) to  (X);
    \draw [->] (z2) to  (X);
    \draw [->] (z3) to  (X);
    \draw [->] (z4) to  (X);
    \draw [dashed, ->] (z3) to (Y);
    \draw [dashed , ->] (z4) to  (Y);
    \draw [white](z2) to node[midway, black] {. . .} (z1);
    \draw [white](z3) to node[midway, black] {. . .} (z2);
    \draw [white](z4) to node[midway, black] {. . .} (z3);
    \end{tikzpicture}
    \caption{Causal diagram with treatment $X$, outcome $Y$, unobserved confounder $U$ (shaded), and candidate IVs $Z_1, \dots, Z_{d_{Z}}$. Dashed edges suggest possible violations of the exclusion criterion.}\vspace{-3mm}
\end{figure}\label{fig:dag}

\subsection{Bound width and uncertainty}\label{appx:bound_width}

For this section, we simulate candidate instruments $\bm Z \sim \mathcal{N}(\bm 0, \SSigma_{\bm{zz}})$ with diagonal covariance matrix and variance fixed at $1 / d_{\Vector Z}$ for each $Z_j$. We fix $\theta = 1$ and simulate linear weights $\bm \beta$ from a standard normal distribution. Our leakage weights $\bm \gamma$ are selected to ensure a signal-to-noise ratio (SNR) of 2 in the structural equation for $Y$ (for more on SNR calculations, see \ref{appx:snr}). With this population covariance matrix fixed, ATE bounds are now a deterministic function of $\tau$. 

For Fig. \ref{fig:bounds}(A), we fix $d_{\Vector Z} = 5$ and vary $\tau$ by running through a range of inflation factors $\lambda \in [0.2, 2]$, where $\lambda = \tau / \lVert \bm \gamma \rVert_2$. At each value, we draw a new population covariance matrix according to the model above to compute $\SSigma$-oracle bounds. We then apply our LeakyIV estimator on 200 separate draws of $n=1000$ from this data generating process. For Fig. \ref{fig:bounds}(B), we repeat the procedure with fixed $\lambda = 1.5$ and vary $d_{\Vector Z}$ from 2 to 20. We use an empirical Bayes shrinkage estimator to ensure a positive definite $\hat{\SSigma}$ \citep{Schafer2005}.