\newpage
\appendix
\onecolumn
\section{Proofs}\label{appx:proofs}

\subsection{Proof of Lemmas \ref{lemma:theta} and \ref{lemma:gamma}}
\begin{comment}
    

We begin with the population covariance matrix $\SSigma$, from which we infer linear weights $\Vector{\beta}$ and the conditional variance of $X$ given $\Vector{Z}$, $\eta_x^2$. 
Our goal now is to define $\theta$ in terms of observable (or identifiable) quantities and just a single unknown parameter, $\rho$.

Our structural equations \ref{eq:scmx}-\ref{eq:scmSigma} imply the following:
\begin{align*}
    \SSigma_{xy} = \SSigma_{xx} \theta + \SSigma_{x\Vector{z}} \Vector{\gamma} + \rho \eta_x \eta_y.
\end{align*}
At this stage, both $\bm \gamma$ and $\eta_y$ are unknown. Start with the latter. Rearranging, we have:
\begin{align*}
    \eta_y &= \frac{\SSigma_{xy} - \SSigma_{xx} \theta - \SSigma_{x\Vector{z}} \Vector{\gamma}}{\rho \eta_x}\\
    &= \frac{\psi - \theta \eta^2_x}{\rho \eta_x}.
\end{align*}
This follows since the final term in the original numerator subtracts the $\Vector{Z}$ signal from the first two terms, which has the same effect as conditioning (recall that $\psi$ represents the conditional covariance $\text{Cov}(X, Y \mid \Vector{Z})$, while $\eta_x^2$ represents the conditional variance $\text{Var}(X \mid \Vector{Z})$). 

*** Start Lee and Gecia sequence for proof ***
\end{comment}
Consider Eqs.~\ref{eq:scmx}, \ref{eq:scmy}, and \ref{eq:scmSigma}, which define our model.
Evaluating the product of $X$ and $\Vector{Z}$ gives a relationship between covariances:
\begin{align}
% \intertext{Evaluating the product of $X$ and $\Vector{Z}$ gives a relationship between covariances}
    X\Vector{Z} &= \Vector{\beta}\DotProd\Vector{Z}\Vector{Z} + \epsilon_x \Vector{Z}\nonumber\\
    \SSigma_{x\Vector{z}} &= \Vector{\beta}\DotProd\SSigma_{\Vector{z}\Vector{z}}\nonumber.%\label{eq:AppendixXZ}
\intertext{Solving for $\Vector{\beta}$ gives}
    \Vector{\beta} &= \SSigma_{x\Vector{z}} \DotProd \SSigma_{\Vector{z}\Vector{z}}^{-1}.\label{eq:AppendixBeta}
\end{align}
%
\begin{align*}
\intertext{Likewise, the product of $X$ with itself gives}
    XX &= \Vector{\beta}\DotProd\Vector{Z}X + 2 \Vector{\beta}\DotProd\Vector{Z} \epsilon_x + \epsilon_x^2\nonumber\\
    \SSigma_{xx} &= \Vector{\beta}\DotProd\SSigma_{\Vector{z}x} + \eta_x^2.\nonumber%\label{eq:AppendixXX}
\intertext{Using \eqref{eq:AppendixBeta} and solving for $\eta_x^2$ gives}
    % \eta_x^2 &= \SSigma_{xx} - \Vector{\beta} \DotProd \SSigma_{\Vector{z}x}\nonumber\\
    \eta_x^2  &= \SSigma_{xx} - \SSigma_{x\Vector{z}} \DotProd \SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}\nonumber\\
    \eta_x^2  &= \KappaXX. %\label{eq:AppendixEtaXX}
\end{align*}
%
\begin{align}
\intertext{The product of $Y$ and $\Vector{Z}$ gives}
    Y\Vector{Z} &= \Vector{\gamma}\DotProd\Vector{Z}\Vector{Z} + \theta X \Vector{Z} + \epsilon_y \Vector{Z}\nonumber\\
    \SSigma_{y\Vector{z}} &= \Vector{\gamma}\DotProd\SSigma_{\Vector{z}\Vector{z}} + \theta \SSigma_{x\Vector{z}}.\nonumber
\end{align}
\begin{align}
\intertext{Solving for $\Vector{\gamma}$ gives}
    \Vector \gamma = \SSigma^{-1}_{\Vector{z}\Vector{z}} \DotProd \big(\SSigma_{\Vector{z}y} - \theta\SSigma_{\Vector{z}x} \big).\nonumber
\intertext{Taking the norm and using the definitions of $\Vector{\alpha}, \Vector{\beta}$, we recover Lemma~\ref{lemma:gamma}:}
    \boxed{\lVert \Vector{\gamma} \rVert_p = g_p(\theta) := \lVert \Vector \alpha - \theta \Vector \beta \rVert_p.} \label{eq:AppendixGamma}
\end{align}
%
\begin{align}
\intertext{The product of $Y$ and $X$ gives}
    YX &= \theta X X + \Vector{\gamma} \DotProd \Vector{Z} X + \epsilon_y X \nonumber\\
    \SSigma_{yx} &= \theta \SSigma_{xx} + \Vector{\gamma} \DotProd \SSigma_{\Vector{z}x} + \rho \eta_x \eta_y. \nonumber
% \intertext{Using \eqref{eq:AppendixGamma} and solving for $\rho\eta_x\eta_y$ gives}
% % \rho \eta_x \eta_y &= \SSigma_{yx} - \theta \SSigma_{xx} - \big( \SSigma_{y\Vector{z}} - \theta \SSigma_{x\Vector{z}} \big)\DotProd\SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}\nonumber\\
%     \rho \eta_x \eta_y &= \KappaXY - \theta \KappaXX \label{eq:AppendixRhoEtaXEtaY}
\intertext{Using \eqref{eq:AppendixGamma} and solving for $\rho\eta_x\eta_y$ gives}
    \rho \eta_x \eta_y &= \SSigma_{yx} - \theta \SSigma_{xx} - \big( \SSigma_{y\Vector{z}} - \theta \SSigma_{x\Vector{z}} \big)\DotProd\SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}\nonumber\\
    \rho \eta_x \eta_y &= \KappaXY - \theta \KappaXX. \label{eq:AppendixRhoEtaXEtaY}
% \end{align}
% \begin{align}
\intertext{Rearranging for $\eta_y^2$ gives}
    \eta_y^2 &= \frac{\big(\KappaXY - \theta \KappaXX\big)^2}{\rho^2\eta_x^2}. \label{eq:AppendixEtaY2}
\end{align}
%
\begin{align}
\intertext{The product of $Y$ with itself gives}
    YY &= \theta X Y + \Vector{\gamma} \DotProd \Vector{Z} Y + \epsilon_y Y \nonumber\\
    \SSigma_{yy} &= \theta \SSigma_{xy} + \Vector{\gamma} \DotProd \SSigma_{\Vector{z}y} + \theta\rho\eta_x\eta_y + \eta_y^2.\nonumber
\intertext{Using \eqref{eq:AppendixGamma}, \eqref{eq:AppendixRhoEtaXEtaY}, and \eqref{eq:AppendixEtaY2} gives}
% \rho \eta_x \eta_y &= \SSigma_{yx} - \theta \SSigma_{xx} - \big( \SSigma_{y\Vector{z}} - \theta \SSigma_{x\Vector{z}} \big)\DotProd\SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}x}\nonumber\\
    \SSigma_{yy} &= \theta \SSigma_{xy} + \big( \SSigma_{y\Vector{z}} - \theta \SSigma_{x\Vector{z}} \big)\DotProd\SSigma_{\Vector{z}\Vector{z}}^{-1} \DotProd \SSigma_{\Vector{z}y}\nonumber\\
    &\hphantom{=} + \theta \big(\KappaXY - \theta \KappaXX\big) + \frac{\big(\KappaXY - \theta \KappaXX\big)^2}{\rho^2\eta_x^2}.\nonumber
\end{align}
%
Combining the $\SSigma$s into $\kappa$s,
\begin{align}
    \KappaYY = \theta \KappaXY + \theta \big(\KappaXY - \theta \KappaXX\big) + \frac{\big(\KappaXY - \theta \KappaXX\big)^2}{\rho^2\eta_x^2}, \nonumber
\end{align}
collecting powers of $\theta$,
\begin{align}
    % 0 &= -\KappaYY + \frac{\KappaXY^2}{\KappaXX\rho^2} + 2 \theta \big(\KappaXY - \frac{\KappaXY}{\rho^2}\big) + \theta^2 \big(-\KappaXX + \frac{\KappaXX}{\rho^2}\big) \nonumber\\
    \KappaYY - \frac{\KappaXY^2}{\KappaXX\rho^2} = 2 \theta \KappaXY \Big(1 - \frac{1}{\rho^2}\Big) - \theta^2 \KappaXX \Big(1-\frac{1}{\rho^2}\Big), \nonumber
\end{align}
and massaging the $\rho$s and $\KappaXX$s around, we have
\begin{align}
    \frac{\KappaYY - \frac{\KappaXY^2}{\KappaXX\rho^2}}{1-\frac{1}{\rho^2}} &= 2 \theta \KappaXY - \theta^2 \KappaXX \nonumber\\
    \frac{\rho^2\KappaYY\KappaXX - \KappaXY^2}{\rho^2-1} &= 2 \theta \KappaXX \KappaXY - \theta^2 \KappaXX^2. \nonumber
    % \frac{\KappaXY^2 - \rho^2\KappaYY\KappaXX}{1 - \rho^2} &= 2 \theta \KappaXX \KappaXY - \theta^2 \KappaXX^2 \nonumber
\end{align}
Solving this quadratic for $\theta\KappaXX$ gives the solutions
\begin{align}
    \theta\KappaXX = \KappaXY \pm \sqrt{\KappaXY^2 - \frac{\KappaXY^2 - \rho^2\KappaYY\KappaXX}{1-\rho^2}}. \nonumber%\label{eq:AppendixThetaQuadratic}
\end{align}
As $\rho>0$ corresponds to the lower solution for $\theta$ (and vice versa), we have
\begin{align*}
    \theta = \frac{1}{\KappaXX} \Bigg(\KappaXY - \rho \sqrt{\frac{\KappaXX\KappaYY - \KappaXY^2}{1 - \rho^2}}\Bigg).
\end{align*}
Exploiting the identity $\tan \big( \arcsin (x) \big)  = x / \sqrt{1 - x^2}$ for $x \in [-1, 1]$, we derive the result stated in Lemma~\ref{lemma:theta}:
\begin{align*}
    \boxed{\theta = f(\rho) = \KappaXX^{-1} \Big( \KappaXY - \sqrt{\KappaXX\KappaYY - \KappaXY^2} \tan \big(\arcsin(\rho)\big) \Big).} %\label{eq:AppendixTheta}
\end{align*}

% \subsection{Proof of Lemma \ref{lemma:gamma}}
% The covariance of $\Vector{Z}$ and $Y$ is due to a combination of direct effects (parameterized by $\Vector{\gamma}$) and indirect effects (parameterized by $\theta$). Specifically, our structural equations \ref{eq:scmx} and \ref{eq:scmy} imply:
% \begin{align*}
%     \SSigma_{\Vector{z}y} = \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\gamma} + \SSigma_{\Vector{z}x} \theta
% \end{align*}
% Rearranging and solving for $\Vector{\gamma}$ gives a linear equation for leakage weights:
% \begin{align*}
%     \Vector{\gamma} = \SSigma^{-1}_{\Vector{z}\Vector{z}} \cdot (\SSigma_{\Vector{z}y} - \SSigma_{\Vector{z}x}\theta ).
% \end{align*}
% % (Using the population covariance matrix, $\SSigma_{\Vector{z}\Vector{z}}$ is guaranteed to be positive definite and therefore invertible.)
% % (Using the population covariance matrix, $\SSigma_{\Vector{z}\Vector{z}}$ is full rank and thus invertible.)
% Since we treat covariance parameters as fixed, this formula is understood as a function of the ATE, which we call $g: \mathbb{R} \mapsto \mathbb{R}^{d_Z}$.

\subsection{Proof of Lemma \ref{lemma:theta_star}}
We can think of the data covariance as constraining $\Vector{\gamma}$ to lie in a $1$-dimensional linear subspace $\Vector{\alpha} - \theta \Vector{\beta}$. (Recall that $\Vector{\alpha}, \Vector{\beta}$ are deterministic functions of $\SSigma$.) 
As $\Vector{\alpha} \DotProd \Vector{\beta}$ may not equal zero in general, the resulting $L_p$ norm cannot be made arbitrarily small. 
Of course, computing the norm-minimizing coefficient is the definition of a linear regression task. 
Call the solution to this problem $\check{\theta}_p$ (where the index indicates optimization with respect to the $L_p$ norm). 
Since partial identification is only possible if information leakage exceeds the theoretical minimum consistent with the data covariance, we may define this lower bound as $\check{\tau}_p := g(\check{\theta}_p)$.

\subsection{Proof of Lemma \ref{lemma:rho_star}}
Recall that $f$ gives $\theta$ as a function of $\rho$, while $g_p$ gives $\lVert \Vector{\gamma} \rVert_p$ as a function of $\theta$.
We define $h_p := g_p \circ f$ as a map from the confounding coefficient $\rho$ to the information leakage $\lVert \Vector \gamma \rVert_p$.
As $h_p$ is a continuous function with a compact domain, the extreme value theorem guarantees that a minimum exists. 
Moreover, since $f$ is bijective, we know by Lemma \ref{lemma:theta_star} that our target value $\check{\rho}_p$ must represent the inverse of $f$ evaluated at $\check{\theta}_p$. Setting $f$ to $\check{\theta}_p$ and solving for $\rho$, we derive the expression. 
%Note that $\frac{\rho}{1-\rho^2} = \text{tan}\big(\text{arcsin}(\rho)\big)$.  
We can now equivalently characterize the minimum leakage parameter as $\check{\tau}_p := h_p(\check{\rho}_p)$.

\subsection{Proof of Thm. \ref{thm:id}}
Thm. \ref{thm:id} provides identifiability criteria for the leaky IV model. In the first part of the theorem, we describe a three-partition of the threshold space in terms of the theoretical leakage minimum $\check{\tau}_p$ and the oracle value $\tau^*_p$. We claim that the identifiability and validity of ATE bounds are fully characterized by where $\tau$ falls in relation to these parameters. Next, we show that point identification is possible if and only if latent parameters align in a specific way.

Take partial identifiability criteria first. 
The task of bounding the ATE in the leaky IV model amounts to finding the min/max values of $\theta$ that satisfy:
\begin{align}\label{eq:g}
    g_p(\theta) = \tau.
\end{align}
In other words, we fit a horizontal line $\lVert \Vector \gamma \rVert_p = \tau$ across the function $g_p$ and report min/max points of intersection.
Recall that $\check{\tau}_p$ is defined as the minimum of this function. Thus when $\tau$ falls below $\check{\tau}_p$, it is clear that there is no intersection between these two curves, and Eq. \ref{eq:g} has no solution. This is our sole partial identifiability criterion: provided all our structural assumptions hold, ATE bounds are well-defined if and only if $\tau \geq \check{\tau}_p$. Below this minimum leakage point lies \textit{the infeasible region}.

But just because bounds are identifiable does not mean that they are valid. Suppose (for now) that the oracle threshold $\tau^*_p$ strictly exceeds the theoretical minimum, and recall the definition:
\begin{align*}
    \tau^*_p := \lVert \Vector \gamma^* \rVert_p = g_p(\theta^*),
\end{align*}
where $\theta^*, \Vector{\gamma}^*$ denote the true unobservable parameters.
By the convexity of the norm, the solutions to Eq. \ref{eq:g} for any $\tau \in [\check{\tau}_p, \tau^*_p)$ will fail to capture the true ATE, as the resulting horizontal line lies below the point $(\lVert \Vector \gamma^* \rVert_p, \theta^*)$. 
Even a $\SSigma$-oracle---who, recall, has access to the population covariance matrix but \textit{not} the latent parameters $\theta^*, \Vector \gamma^*$---will return invalid bounds if queried with a threshold in this half-closed interval. 
For this reason, we call this band \textit{the error region}.

With a threshold at or above the oracle value $\tau^*_p$, solutions to Eq. \ref{eq:g} are finally guaranteed to contain the true ATE $\theta^*$. This once again follows by convexity of $g_p$. Resulting bounds grow increasingly conservative with $\tau$. This \textit{valid region} for all $\tau \geq \tau^*_p$ completes our three-partition of the threshold space. 

We have thus far assumed that the oracle threshold \textit{strictly} exceeds the theoretical minimum, but these parameters may coincide. If $\tau^*_p = \check{\tau}_p$, then the error region is empty and all identifiable bounds are valid. Moreover, if $g_p$ attains a unique minimum---as it must for all strictly convex norms, i.e. $p \in (1, \infty)$---then there exists just a single solution to Eq. \ref{eq:g} for $\tau = \tau^*_p = \check{\tau}_p$. In this case, lower and upper bounds for the ATE coincide and the causal parameter is point identified as $\theta^* = \check{\theta}_p$. Note that this fortuitous circumstance occurs with Lebesgue measure zero, as it imposes a nontrivial polynomial constraint on covariance parameters \citep{okamoto1973}.

\subsection{Proof of Corollary \ref{corollary:testable}}
Under Eq. \ref{eq:scmy} and the exclusion criterion (A3), we must have that $\tau = \tau^*_p = 0$. Since $0 \leq \check{\tau}_p \leq \tau^*_p$, it follows that $\tau = \tau^*_p = \check{\tau}_p = 0$. This is a special case of the point identifiability result of Thm. \ref{thm:id}. 
Define 
\begin{align*}
    \hat{X} := \mathbb{E}[X \mid \Vector{Z}] = \Vector{\beta} \DotProd \Vector{Z},
\end{align*}
which represents the expected result of the first OLS regression. Then the 2SLS solution can be written as the ratio:
\begin{align*}
    \theta^{\text{2SLS}} &:= \frac{\SSigma_{\hat{x}y}}{\SSigma_{\hat{x}\hat{x}}}\\
    &= \frac{\SSigma_{\Vector{z}x} \DotProd \SSigma_{\Vector{z}y} }{\SSigma_{\Vector{z}x} \DotProd \SSigma_{\Vector{z}x}}.
\end{align*}
Exploiting the definitions $\Vector \alpha := \SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}y}$ and $\Vector \beta := \SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}x}$, we have:
\begin{align*}
    \check{\theta}_2 &= (\Vector{\beta} \DotProd \Vector{\beta})^{-1} \Vector{\beta}\DotProd \Vector{\alpha}\\
    &= \frac{(\SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}x}) \DotProd (\SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}y})}{(\SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}x}) \DotProd (\SSigma_{\Vector{zz}}^{-1} \DotProd \SSigma_{\Vector{z}x})}\\
    &= \frac{\SSigma_{\Vector{z}x} \DotProd \SSigma_{\Vector{z}y}}{\SSigma_{\Vector{z}x} \DotProd \SSigma_{\Vector{z}x}}\\
    &= \theta^{\text{2SLS}}.
\end{align*}
 

\subsection{Proof of Thm. \ref{thm:ate_bounds}}
We assume that partial identifiability criteria are met (see Thm. \ref{thm:id}).
Recall that $\theta$ is a bijective function of $\rho$ (see Lemma \ref{lemma:theta}). 
To compute valid, sharp ATE bounds, it is therefore sufficient to show that there exist unique minimum and maximum values of $\rho$ such that $h_p(\rho) = \tau$. 
Call these $\rho^-_{\tau, p}$ and $\rho^+_{\tau, p}$, respectively. 
(Since the function $f$ is strictly decreasing, it maps the former to $\theta^+$ and the latter to $\theta^-$.)
The advantage of working in $\rho$-space rather than $\theta$-space is that the confounding coefficient is guaranteed to lie on a compact interval that is independent of the data, namely $[-1, 1]$. 

Recall that the $L_p$ norm is convex for all $p \geq 1$ and strictly convex for $p \in (1, \infty)$. 
Also, by definition, we have that $\tau^*_p \geq \check{\tau}_p$. Thus we have four possibilities to consider, with strict and non-strict variants of both convexity and the leakage inequality (see Table \ref{tab:solutions}). 
Non-strict convexity raises complications due to the potential for plateaus in the $L_p$ norm; non-strict inequality raises complications if true and minimum leakage parameters coincide.
We will show that $\rho^-_{\tau, p}$ and $\rho^+_{\tau, p}$ are uniquely identified in all four settings.

\begin{table}\caption{Contingency table of settings to consider for Thm. \ref{thm:ate_bounds}. Note that under non-strict convexity, an interval solution for $\check{\rho}_p$ is possible but not necessary; likewise, under non-strict leakage inequality, $\tau^*_p = \check{\tau}_p$ is possible but not necessary.}\label{tab:solutions}
\begin{center}
\begin{tabular}
    {ccc} & \multicolumn{2}{c}{Convexity of $L_p$ norm} 
    \\ \cmidrule{2-3} Inequality & Strict & Non-strict \\ \midrule 
    \multicolumn{1}{c|}{Strict} & Unique $\check{\rho}_p, \tau^*_p > \check{\tau}_p$ & Interval $\check{\rho}_p, \tau^*_p > \check{\tau}_p$ \\
    \multicolumn{1}{c|}{Non-strict} & Unique $\check{\rho}_p, \tau^*_p = \check{\tau}_p$ & Interval $\check{\rho}_p, \tau^*_p = \check{\tau}_p$\\ 
    \hline
\end{tabular}
\end{center}
\end{table}

Start with the simplest case, in which both the convexity and inequality are strict. 
In this setting, we have exactly two solutions to the equation $h_p(\rho) = \tau$, one on either side of $\check{\rho}_p$, which is the unique minimizer of $h_p$. Thus one solution lies on the interval $[-1, \check{\rho}_p]$, and another on $[\check{\rho}_p, 1]$.\footnote{In fact, when $\tau^*_p > \check{\tau}_p$, we know that $\check{\rho}_p$ is not a viable solution, and so we can replace the closed intervals with half-open intervals $[-1, \check{\rho}_p)$ and $(\check{\rho}_p, 1]$. Since this is not the case when $\tau^*_p = \check{\tau}_p$, we stick with closed intervals throughout for greater generality.} 
This establishes the existence and uniqueness of $\rho^-_{\tau, p}$ and $\rho^+_{\tau, p}$.

Now consider the case where $h_p$ is strictly convex but the true leakage coincides with the theoretical minimum (lower left quadrant of Table \ref{tab:solutions}). In this case, we have just a single solution to the equation $h_p(\rho) = \tau$, namely $\check{\rho}_p$. This implies that $\check{\rho}_p = \rho^-_{\tau, p} = \rho^+_{\tau, p}$.

Greater care is required when $h_p$ is not strictly convex, as we can no longer assume the uniqueness of $\check{\rho}_p$ or that $h_p(\rho) = \tau$ has at most two solutions. 
However, when no unique minimum exists for a convex function with a compact domain, the set of minimizing solutions forms a compact interval. (This follows from the extreme value theorem.) 
Consider the setting where the leakage inequality is strict but no single value of $\rho$ minimizes $h_p$ (upper right quadrant of Table \ref{tab:solutions}). 
We can select any value from the compact interval $\check{\rho}_p$ and use this to partition $[-1, 1]$, since strict inequality guarantees that any solution must intersect with $h_p$ above its minimum.
Still, we may have have uncountably many solutions to the equation $h_p(\rho) = \tau$ if $\tau$ aligns with a plateau in the norm on one or both sides of $\check{\rho}_p$. 
Convexity guarantees that we will have at most two sets of solutions, one on either side of the minimum. 
Call these intervals $\rho_0$ and $\rho_1$.
Since both are closed, each contains a unique min/max. 
Our target parameters are therefore identified by taking the extreme values of each, i.e. setting $\rho^-_{\tau, p} = \min \rho_0$ and $\rho^+_{\tau, p} = \max \rho_1$.

Finally, consider the case where neither the convexity of the $L_p$ norm nor the leakage inequality is strict (lower right quadrant of Table \ref{tab:solutions}). This is arguably simpler than the setting with strict inequality and non-strict convexity, since we have just a single compact interval of solutions at $\check{\rho}_p$. 
Our target parameters in this case are identified via $\rho^-_{\tau, p} = \min \check{\rho}_p$ and $\rho^+_{\tau, p} = \max \check{\rho}_p$.

\subsection{Proof of Corollary \ref{corollary:l2}}
To find ATE bounds with an $L_2$ threshold on information leakage, we invoke Lemma \ref{lemma:gamma} and find that leakage is quadratic in $\theta$:
\begin{align*}
    \lVert \Vector \gamma \rVert_2^2 = \lVert \Vector{\beta} \rVert_2^2 ~\theta^2 - 2 \Vector{\alpha} \DotProd \Vector{\beta} ~\theta + \lVert \Vector{\alpha} \rVert_2^2.
\end{align*}
We set $\tau = \lVert \Vector \gamma \rVert_2$ and solve for $\theta$ using the quadratic formula:
\begin{align*}
    \theta = \frac{2 \Vector{\alpha} \DotProd \Vector{\beta} \pm \sqrt{(2 \Vector{\alpha} \DotProd \Vector{\beta})^2 - 4 \lVert \Vector \beta \rVert_2^2 ~\big (\lVert \Vector \alpha \rVert_2^2 - \tau^2\big)}}{2 \lVert \Vector \beta \rVert_2^2}.
\end{align*}
Observe that the first summand reduces to the norm-minimizing ATE value identified in Lemma \ref{lemma:theta_star}:
\begin{align*}
    \frac{2 \Vector{\alpha} \DotProd \Vector{\beta}}{2 \lVert \Vector \beta \rVert_2^2} = (\Vector{\beta} \DotProd \Vector{\beta})^{-1} ~\Vector{\beta} \DotProd \Vector{\alpha} =: \check{\theta}_2.
\end{align*}
Some light simplifications and rearrangements renders the final expression:
\begin{align*}
    \boxed{\check{\theta}_2 \pm ({\Vector{\beta}\DotProd \Vector{\beta}})^{-1} \sqrt{(\Vector{\beta}\DotProd \Vector{\beta}) ~(\tau^2- \Vector{\alpha}\DotProd \Vector{\alpha}) + (\Vector{\alpha}\DotProd \Vector{\beta})^2 }}.
\end{align*}

\subsection{Proof of Thm. \ref{thm:test}}
Let $\mathcal{M}$ be the space of all models satisfying our structural constraints---Eqs. \ref{eq:scmx}, \ref{eq:scmy}, \ref{eq:scmSigma} and assumptions (A1), (A2), and (A$3'_s$)---for some fixed distribution family $\mathcal{P}$ and $d_{\Vector{Z}} \geq 2$. (Recall that (A$3'_s$) is consistent with the classic exclusion criterion (A3) under $\tau=0$.)
We partition $\mathcal{M}$ into null and alternative classes $\mathcal{M}_0, \mathcal{M}_1$ depending on whether the models in each satisfy $H_0: \psi = 0$.
%\footnote{Recall the definitions $\psi := \det(\Vector \Lambda \DotProd \Vector \Lambda)$ and $\Vector \Lambda := [\Vector \alpha, \Vector \beta]$. As these latter two vectors share a common factor $\SSigma_{\Vector{zz}}^{-1}$, we could alternatively choose to define $\Vector \Lambda$ as $[\SSigma_{\Vector{z}y}, \SSigma_{\Vector{z}x}]$ with no impact on results.} 
We reiterate that this condition is necessary but not sufficient to guarantee (A3). 
Each dataset $\mathcal{D}_n$ is sampled from some fixed but unknown $P_{\SSigma}$ that belongs to either $\mathcal{M}_0$ or $\mathcal{M}_1$.

For every $P_{\SSigma} \in \mathcal{M}$, there exists some nearest null neighbor $Q^*_{\SSigma} \in \mathcal{M}_0$ (not necessarily unique) satisfying
\begin{align*}
    Q^*_{\SSigma} := \argmin_{Q_{\SSigma} \in \mathcal{M}_0} D_{KL}(P_{\SSigma} ~||~ Q_{\SSigma}).
\end{align*}
Of course, when $P_{\SSigma} \in \mathcal{M}_0$, we have $P_{\SSigma} = Q^*_{\SSigma}$ and the KL-divergence goes to zero.
Let $f_\psi: \mathbb{R}^{n \times (2 + d_{\Vector{Z}})} \mapsto \mathbb{R}_{\geq 0}$ be a function from input data to corresponding test statistics $\psi$. 
(The bounded range follows from the fact that all entries in the matrix $\Lambda \DotProd \Lambda$ are non-negative.)
Let $\mathcal{D}_n$ be a dataset sampled from $P_{\SSigma}$, and let $G_n^{\SSigma}$ be the sampling distribution of $\psi$ at sample size $n$, i.e. $\hat{\psi}_n = f_\psi(\mathcal{D}_n) \sim G_n^{\SSigma}$ for any $\mathcal{D}_n \sim P_{\SSigma}$. 
We denote the corresponding null distribution as $G_n^{\SSigma^0}$, which represents the sampling distribution of $\psi$ under $H_0$ at sample size $n$, i.e. $\psi^0_n = f_\psi(\mathcal{D}_n^0) \sim G_n^{\SSigma^0}$, for null datasets $\mathcal{D}_n^0 \sim Q^*_{\SSigma}$. 

To establish that $p_{\text{MC}}$ is an asymptotically valid $p$-value against $H_0$, it suffices to show that the Monte Carlo null distribution $\hat{G}_n^{\SSigma^0}$ converges to $G_n^{\SSigma^0}$.
This follows from the validity of our procedure for constructing the null covariance matrix $\SSigma^0$, which involves the minimum perturbation required to guarantee $H_0$. Specifically, we impose a linear dependence between covariance vectors $\SSigma_{\Vector{z}x}$ and $\SSigma_{\Vector{z}y}$ using the scaling factor $\hat{\theta}^{\text{2SLS}}$. 
Thus $\SSigma^0$ satisfies $\psi=0$ by construction. Moreover, since this is achieved by changing as few parameters as possible by as little as possible, there exists no nearer neighbor to $P_{\SSigma}$ within $\mathcal{M}_0$ than the resulting distribution, which therefore satisfies our definition of $Q^*_{\SSigma}$. 

We assume access to some method for sampling from $Q^*_{\SSigma}$, e.g. via $\mathcal{N}(\bm 0, \SSigma^0)$ if $\mathcal{P}$ is the family of mean-zero multivariate Gaussians. 
We draw $B$ many datasets of size $n$ from $Q^*_{\SSigma}$ and record resulting test statistics to generate the synthetic null distribution $\hat{G}_n^{\SSigma_0}$.
Convergence is assured when $p_{\text{MC}}$ is uniformly distributed under $H_0$. 
%Under the alternative hypothesis $H_1: \psi > 0$, by contrast, we expect $p_{\text{MC}}$ to concentrate nearer to zero.\footnote{Since all values of $\Lambda \DotProd \Lambda$ are necessarily positive, the minimum of the determinant is zero.} 
Let $c_\alpha(\mathcal{D}_n)$ denote the critical value at type I error rate $\alpha$ for dataset $\mathcal{D}_n$, such that, under $H_0$, the rejection region of statistics
\begin{align*}
    R_\alpha(\mathcal{D}_n) = \big\{\psi_n: \psi_n \geq c_\alpha(\mathcal{D}_n)\big\}
\end{align*}
integrates to $\alpha$. 
Rejection regions are nested for our one-sided test, i.e. $R_\alpha \subset R_{\alpha'}$ if $\alpha < \alpha'$.
Thus $c_\alpha(\mathcal{D}_n)$ represents the $1-\alpha$ quantile of the null distribution $G^{\SSigma_0}_n$. 
We reject $H_0$ if $p_{\text{MC}} \leq \alpha$, resulting in the identity:
\begin{align*}
    p_{\text{MC}} = p_{\text{MC}}(\mathcal{D}_n) = \inf \big\{\alpha: \psi_n \in R_\alpha(\mathcal{D}_n)\big\}.
\end{align*}
Then for all $\alpha \in (0,1)$, we have
\begin{align*}
    \mathbb{P}_{\mathcal{D}_n \sim Q^*_{\SSigma}}\big(\psi_n \in R_\alpha(\mathcal{D}_n)\big) = \alpha,
\end{align*}
which implies that:
\begin{align*}
    \mathbb{P}_{\mathcal{D}_n \sim Q^*_{\SSigma}}\big(p_{\text{MC}}(\mathcal{D}_n) \leq u\big) = u
\end{align*}
for all $u \in [0,1]$ \citep{Lehmann2005}. In other words, $p_{\text{MC}}$ is uniformly distributed under $H_0$, as desired, and the Monte Carlo null distribution has converged on the target $G_n^{\SSigma^0}$. We add one to the numerator and denominator as a necessary finite sample adjustment.


\subsection{Proof of Thm. \ref{thm:cvg}}
It may not be immediately clear that bootstrapping is appropriate for ATE bounds. 
After all, it is well known that the bootstrap cannot provide a valid sampling distribution for fixed order statistics such as ranks, or a target parameter that lies on the boundary of the parameter space \citep{andrews2000}.
But though we refer to our bounds as ``min'' and ``max'' solutions, that does not mean they are calculated via fixed order statistics. 
On the contrary, each represents a continuous solution to a differentiable optimization task, not the smallest or largest element in a discrete set. 
The only boundary condition for our target parameters is $\theta^- \leq \theta^+$, which automatically holds under the partial identifiability criterion $\tau \geq \check{\tau}_p$. 
As our estimator is undefined for samples that violate the criterion, our sampling distribution is always conditioned on this event. 

In general, any statistic that is a differentiable function of sample moments admits an Edgeworth expansion and can therefore have its distribution consistently estimated via bootstrap resampling \citep{hall1992, davison_hinkley}. 
Recall that our ATE bounds represent the intersection of (a) a $\tau$-feasible region that is fixed \textit{a priori}; and (b) the $L_p$ norm of $\Vector{\gamma} = \Vector{\alpha} - \theta \Vector{\beta}$, where the latter two vectors are defined as dot products of covariance parameters. 
Resulting bounds vary smoothly under resampling, since $\Vector{\alpha}$ and $\Vector{\beta}$ are differentiable with respect to $\SSigma$. 
Though more generalized relaxations of the exclusion criterion may introduce discontinuities or other issues, our formulation of $\tau$-exclusion poses no such difficulties. 
The resulting bootstrap distributions are asymptotically valid and practically useful, providing statistical inference without any parametric assumptions.

Of course, it is perfectly possible that some bootstrap samples may have to be discarded if the intersection of regions (a) and (b) is empty. 
In such cases, we simply restrict attention to those bootstraps that satisfy the partial identifiability criterion $\tau > \check{\tau}_p$, which should represent a non-negligible proportion of all bootstraps if the inequality is satisfied in the original dataset. 
This procedure is akin to sampling under a feasibility condition, and requires no extra steps to maintain bootstrap consistency, as in \cite{andrews2000} or \cite{ramsahai_likelihood_2011}.


\begin{comment}
    Let $Q(X_1, X_n; F)$ be some statistic estimated on $n$ samples from distribution $F \in \mathcal{P}$. We aim to estimate the distribution function:
\begin{align*}
    G_{F,n}(q) := \mathbb{P}\big(Q(X_1, X_n; F) \leq q \mid F \big)
\end{align*}
via the bootstrap estimator $G_{\hat{F},n}(q)$. 
\citet{davison_hinkley} identify three conditions that are necessary and sufficient to ensure consistency:
\begin{itemize}
    \item For any $A \in \mathcal{P}$
\end{itemize}
\end{comment}

\section{Experiments}\label{appx:exp}

\subsection{Benchmarks}\label{appx:more_exp}

We implement the backdoor adjustment via simple linear regression. Similarly, the 2SLS estimator is computed using OLS. We use the \texttt{CRAN} implementation of sisVIVE. R code for MBE was provided by the authors. 
Results of benchmark experiments for all simulation configurations are presented in Fig. \ref{fig:supplemental}.

\begin{figure}[t]
  \centering
  \includegraphics[width=0.95\columnwidth]{figures/full_benchmarks.pdf}
  \vspace{-2mm}
  \caption{Complete results for the benchmark experiment against point estimators in Sect. \ref{sec:experiments}. The top two rows uses a diagonal covariance matrix; the bottom two use a Toeplitz covariance matrix. For each value of $\text{SNR}_X$, we consider three unique values of $\text{SNR}_Y$ (simply labelled SNR within the facet grid).}
  \vspace{-2mm}
  \label{fig:supplemental}
\end{figure}

\subsection{Interpreting the scales of structural parameters}\label{appx:snr}

In our experiments, we fix the true causal effect $\theta^* = 1$ and tune the signal-to-noise ratios (SNRs) for $X$ and $Y$, denoted here as $\mathrm{\mathrm{SNR}}_X$ and $\mathrm{SNR}_Y$ respectively. We show that these quantities, coupled with the variance $\SSigma_{yy}$, the magnitude of $\Vector{\beta}$ and the instruments covariance matrix $\SSigma_{\Vector{zz}}$, uniquely define the magnitude of the remaining structural parameters $(\Vector{\gamma}, \SSigma_{yy}, \eta_x, \eta_y)$ at a given level of confounding $\rho$, when the directions $(\Vector{\gamma}, \Vector{\beta})$ are randomized. The choice to tune the dimensionless parameters $\mathrm{\mathrm{SNR}}_X$ and $\mathrm{SNR}_Y$ provides a more interpretable grid search than we would have if we were to vary the structural parameters directly. This is especially true given that the directions of the vector-valued structural parameters---randomized in our experiments and exponentially hard to search through in a high-$d_{\Vector{Z}}$ setting---play a crucial role in the effect in the proportions of each observable's variance explained by each causal effect.   

In our setting, $\Vector{\beta}$ and $\Vector{\gamma}$ are randomized through $\Vector{\beta} = \Vector{\Tilde{\beta}}$ and $\Vector{\gamma} = \zeta \Vector{\Tilde{\gamma}}$, where the components $\Tilde{\beta}_i$ and $\Tilde{\gamma}_i$, $i \in [d_{\Vector{Z}}]$ are drawn identically and independently ($\mathrm{iid}$) from the the standard normal distribution. We show that $\eta_x, \eta_y, \zeta, \SSigma_{xx}$ are uniquely determined through the equations for the variances,
\begin{equation*}
\begin{split}
    & \SSigma_{xx} = \Vector{\Tilde{\beta}} \cdot \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\Tilde{\beta}} + \eta_x^2, \\
    & \SSigma_{yy} = \zeta^2 \Vector{\Tilde{\gamma}} \cdot \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\Tilde{\gamma}} + \theta^2 \SSigma_{xx} + 2 \theta \zeta \Vector{\Tilde{\gamma}} \cdot \SSigma_{\Vector{zz}} \cdot \Vector{\Tilde{\beta}} + 2 \theta \eta_x \eta_y \rho + \eta_y^2,
\end{split}
\end{equation*}
and the signal-to-noise ratios, 
\begin{equation*}
\begin{split}
    & \mathrm{SNR}_X := \frac{\Vector{\beta} \cdot \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\beta}}{\eta_x^2} = \frac{1}{\eta_x^2} \Vector{\Tilde{\beta}} \cdot \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\Tilde{\beta}}, \\
    & \mathrm{SNR}_Y := \frac{\Vector{\gamma} \cdot \SSigma_{\Vector{z}\Vector{z}} \cdot \Vector{\gamma} + \theta^2 \SSigma_{xx} + 2 \theta \Vector{\gamma} \cdot \SSigma_{\Vector{zz}} \cdot \Vector{\beta}}{2 \theta \eta_x \eta_y \rho + \eta_y^2} = \frac{\SSigma_{yy}}{\eta_y^2 + 2 \theta \rho \eta_x \eta_y} - 1.
\end{split}
\end{equation*}
Note that ``noise'' is any contribution involving the unobserved confounding $\SSigma_{\Vector \epsilon \Vector \epsilon}$. In a certain sense, the definition of $\mathrm{SNR}_x$ is ambiguous in our setting because $\eta_x^2 = \kappa_{xx}$ can be determined from generated data. We stress our choice of the definition here: if we took $\eta_x^2$ to be ``signal'' then $\mathrm{SNR}_X$ would be infinite. 

We solve these four coupled quadratic equations for the remaining parameters $\eta_x, \eta_y, \zeta, \SSigma_{xx}$. They have unique solutions if we demand each of these scaling factor and the standard deviations $\eta_x, \eta_y$ to be positive. We choose to write the solutions to these equations in the following form: 
\begin{align}
    & \eta_x = \sqrt{\frac{A_{xx}}{\mathrm{SNR}_X}}, \\
    & \SSigma_{xx} = \frac{\mathrm{SNR}_X + 1}{\mathrm{SNR}_X}, \\
    & \eta_y = \theta \rho \eta_x \left( -1 + \sqrt{1 + \frac{\SSigma_{yy}}{1+\mathrm{SNR}_Y}}\right) , \\
    & \zeta = \frac{\theta A_{yy}}{A_{xy}} \left( - 1 + \sqrt{1 + \frac{A_{xy}}{{A_{yy}}^2 \theta^2}  \left( \frac{\SSigma_{yy}}{1 + \frac{1}{\mathrm{SNR}_Y}} - \theta^2 \SSigma_{xx} \right)} \right),
\end{align}
where
\begin{align*}
    & A_{xx} := \Vector{\Tilde{\beta}} \cdot \SSigma_{\Vector{zz}} \cdot \Vector{\Tilde{\beta}},\\
    & A_{xy} := \Vector{\Tilde{\beta}} \cdot \SSigma_{\Vector{zz}} \cdot \Vector{\Tilde{\gamma}},\\
    & A_{yy} := \Vector{\Tilde{\gamma}} \cdot \SSigma_{\Vector{zz}} \cdot \Vector{\Tilde{\gamma}}.
\end{align*}
Notice that the term in the brackets in the equation for $\zeta$ is the signal due to $\tau$-exclusion, i.e., the residual signal not solely due to $\theta$. This term is, therefore, always greater than $1$, so $\zeta$ is always greater than $0$. Notice also that these equations are more general than in our particular experimental setting since we have left $\theta = \theta^{*}$, $\SSigma_{\Vector{zz}}$ and $\SSigma_{yy}$ to be decided. 

Inputting the above solutions, in order, to a data generating process allows us to tune these terms by specifying the signal-to-noise ratios. As a final note, one may be interested in studying asymptotic regimes in which the variance of $X$ or $Y$ is dominated either by signal or noise. These equations, or equations very similar to these, allow for the rigorous study of linear models in such regimes through asymptotic expansions with respect to the SNRs.

\subsection{Bayesian baseline}\label{appx:bayes}

One of our choices of a method to compare against ours is a full-likelihood Gaussian model with Bayesian posteriors with a bounded L2 norm on $\gamma$.

Assume all variables follow a joint multivariate zero-mean Gaussian distribution. Recall that that $\Vector Z$ are candidate instruments, $X$ is the treatment, $Y$ is the outcome. 

Let $\Vector \gamma$ be the coefficients of $\Vector Z$ in the equation for $Y$, and $\tau$ is such that $||\Vector \gamma||_2 \leq \tau$. We will encode $\Vector \gamma$ as 
$$\Vector \gamma := \displaystyle \Vector{b} \times \sqrt{\frac{\kappa \times \tau}{||\Vector b||_2^2}},
$$
\noindent where $\kappa \in [0, 1]$ is another (redundant) parameter, and $\Vector b$ is the free parameter vector of the same dimensionality as $\Vector \gamma$. The interpretation is that $\kappa \times \tau$ is the norm of $\Vector \gamma$, and $\Vector b$ is a direction vector. This provides a direct comparison against our constrained optimization method, as both methods are capable of directly using information about the norm of $\Vector \gamma$ and we will below put an uniform prior on $\kappa$.
  
Assuming $\Vector Z$ below is a row vector, the model is:
\[
\begin{array}{rcl}
\Vector Z & \sim & MVN(0, \SSigma_{\Vector{zz}})\\
X &=& \Vector{\beta} \DotProd \Vector{Z} + \epsilon_x\\
Y &=& \Vector{\gamma} \DotProd \Vector{Z} \theta X + \epsilon_y\\
(\epsilon_x, \epsilon_y) &\sim & MVN(0, \SSigma_{\epsilon \epsilon}),\\
\end{array}
\]
\noindent where $\SSigma_{\Vector{zz}}$ and $\SSigma_{\epsilon \epsilon}$ are generic positive definite matrices, and $MVN$ means multivariate Gaussian distribution. The parameter set $\Theta$ is $\{\Vector \beta, \kappa, \Vector b, \theta, \SSigma_{\Vector{zz}}, \SSigma_{\epsilon \epsilon}\}$.

In what follows, we will consider only the independent candidate instruments case 
$$\SSigma_{\Vector{zz}} := 
\begin{bmatrix}
\eta_{z1}^2 & 0 & 0 &\dots & 0\\
0 & \eta_{z2}^2 & 0 & \dots & 0\\
0 & 0 & 0 & \dots & \eta_{z_{d_Z}}^2\\
\end{bmatrix},
$$
and parameterize $\SSigma_{\epsilon \epsilon}$ as
$$\SSigma_{\epsilon \epsilon} := 
\begin{bmatrix}
\eta_x^2 & \rho \eta_x \eta_y \\
\rho \eta_x \eta_y & \eta_y^2 \\
\end{bmatrix},
$$
for $\rho \in [-1, 1]$. The independence assumption on $\Vector Z$ is merely to simplify our sampler code.

Priors are defined as follows:
\[
\begin{array}{rcl}
\kappa &\sim& U(0, 1)\\
\Vector \beta &\sim& MVN(0, I \times v_\beta)\\
\Vector b &\sim& MVN(0, I \times v_b)\\
\theta &\sim& N(0, v_\theta)\\
\eta_{zi}^2 &\sim& logN(l_{\mu_z}, l_{v_z}), \text{for $i = 1, 2, \dots, n_Z$}\\
\eta_x^2 &\sim& logN(l_{\mu_x}, l_{v_x})\\
\eta_y^2 &\sim& logN(l_{\mu_y}, l_{v_y})\\
\rho & \sim& U(-1, 1)\\
\end{array}
\]
\noindent where $I$ is the identity matrix of corresponding dimensionality, $U$ is the uniform distribution on the unit interval, and $logN$ denotes the log-normal distribution. Remaining symbols $v_\beta, v_b, v_\theta, l_{\mu_z}, l_{v_z}, l_{\mu_x}, l_{v_x}, l_{\mu_y},$ and $l_{v_y}$ are hyperparameters.

Given a dataset $D$ with each of its $n$ rows denoting a data point, the sufficient statistic for this model is $$S := D \DotProd D.$$ 
The \emph{model covariance matrix} $\SSigma(\Theta)$ is given by
\[
\begin{array}{rcl}
    \SSigma(\Theta)_{\Vector{zz}} &:=& \ \SSigma_{\Vector{zz}}\\
    \SSigma(\Theta)_{\Vector{z}x} &:=& \ \SSigma_{\Vector{zz}}\beta\\
    \SSigma(\Theta)_{xx} &:=& \Vector \beta \DotProd \SSigma_{\Vector{zz}} \DotProd \beta + \eta_x^2\\
    \SSigma(\Theta)_{xy} &:=& \Sigma(\Theta)_{\Vector{z}x} \DotProd \Vector \gamma + \eta_x^2\times \theta + \eta_{xy}\\
    \SSigma(\Theta)_{\Vector{z}y} &:=& \SSigma_{\Vector{zz}}\DotProd \Vector \gamma + \SSigma(\Theta)_{\Vector{z}x}\times \theta\\
    \SSigma(\Theta)_{yy} &:=& 
    \Vector \gamma \DotProd \SSigma_{\Vector{zz}} \DotProd \Vector \gamma +
    2 \times \Vector \gamma \DotProd \Sigma_{zx}(\Theta) \times \theta +\\&&    
    \theta^2 \times \Sigma(\Theta)_{xx} + 2 \times \theta \times n_{xy} + \eta_y^2.\\
\end{array}
\]
Given that, the log-likelihood function is
$$
L(\Theta) := -0.5 \times trace(\SSigma(\Theta)^{-1} S) - 0.5 \times n \times \log(|\SSigma(\Theta)|),
$$
\noindent where the columns/rows of $S$ are sorted in the same way as the columns/rows of $\Sigma(\Theta)$. We use a plain random walk Metropolis-Hastings method to sample from the posterior of this distribution. We implement the uniform priors by the encoding $\kappa := \Phi(W_\kappa)$, where $\Phi(\cdot)$ is the standard Gaussian cdf and $W_\kappa$ is a standard Gaussian random variable to be sampled. Likewise, $\rho := 2 \times \Phi(W_\rho) - 1$.

\subsubsection{Usage}

Like MASSIVE \citep{bucur2020}, this is a full Bayesian approach that returns a full posterior on $\theta$. Unlike MASSIVE, which is designed for soft sparsity constraints, our prior on $\gamma$ is on a Gaussian distributed disc with the radius being given a uniform prior on $[0, \tau]$ so that it is the closer match to the principle behind our hard constrained optimization method.

For a comparison to take place, we translate the posterior distribution over $\theta$ as ``bounds''. More precisely, let $q_{\theta_\alpha}$ be the $\alpha$-th quantile of the posterior distribution of $\theta$. A ``lower bound'' here is taken to mean the $[-\infty, q_{\theta_{\alpha}}]$ interval, although it is not explicitly made to ``capture'' the population lower bound with probability $\alpha$. That can only happen in a heuristic sense, as the full Bayesian approach is oblivious to non-identifiability issues and has no sense what a ``population lower bound'' should be. For joint ``capturing'' of lower and upper bounds, we suggest  $[-\infty, q_{\theta_{\alpha / 2}}]$ along with $[q_{\theta_{1 - \alpha / 2}}, +\infty]$. 

Like MASSIVE, the posterior on $\theta$ never converges to a single point in the limit of the infinite data, and any entropy left in that case is a consequence of the prior. If the prior is informative, it is entirely possible for the posterior to exclude the true $\theta$ even if it perfectly fits the population distribution. If the prior is uninformative, the limiting posterior support should coincide with the feasible region obtained by plugging in the population covariance matrix, although any curvature of this posterior within its support is an artefact of the prior. However, with finite data, there is no clear way of separating entropy due to probabilistic uncertainty from entropy due to unidentifiability. Any claims that tails of this distribution have a correspondence to partial identifiability results is an ill-posed heuristic at best.
