\documentclass{article}

\title{Simple sampler as a baseline model}

\begin{document}

\maketitle

\section{Model}

Let's compare our approach against a full-likelihood Gaussian model with Bayesian posteriors.

Assume everything is Gaussian and zero mean. 
In what follows, $Z$ are candidate instruments, $X$ is the treatment, $Y$ is the outcome. $MVN$ means multivariate Gaussian. 

If $\gamma$ are the coefficients of $Z$ in the equation for $Y$, and $\tau$ is such that $||\gamma||_2^2 \leq \tau$ (just bothering with $p = 2$ here), let's encode $\gamma$ as 
$$\gamma := \displaystyle b \times \sqrt{\frac{\kappa \times \tau}{||b||_2^2}},
$$
\noindent where $\kappa \in [0, 1]$ is another (redundant) parameter, and $b$ is the free parameter vector of the same dimensionality as $\gamma$.
  
Assuming $Z$ below is a row vector, the model is:
\[
\begin{array}{rcl}
Z & \sim & MVN(0, \Sigma_{zz})\\
X &=& Z^{\mathsf T}\beta + \epsilon_x\\
Y &=& Z^{\mathsf T}\gamma + X\theta + \epsilon_y\\
(\epsilon_x, \epsilon_y) &\sim & MVN(0, \Sigma_{\epsilon \epsilon}),\\
\end{array}
\]
\noindent where $\Sigma_{zz}$ and $\Sigma_{\epsilon \epsilon}$ are generic positive definite matrices. The parameter set $\Theta$ is $\{\beta, \kappa, b, \theta, \Sigma_{zz}, \Sigma_{\epsilon \epsilon}\}$.

Priors are defined as follows. $I$ means identity matrix and anything else undefined so far is a hyperparameter. Distribution $U(0, 1)$ is the uniform distribution in the unit interval. $IW$ is the inverse Wishart.
\[
\begin{array}{rcl}
\kappa &\sim& U(0, 1)\\
\beta &\sim& MVN(0, I \times v_\beta)\\
b &\sim& MVN(0, I \times v_b)\\
\theta &\sim& N(0, v_\theta)\\
\Sigma_{zz} &\sim& invW(\delta_{zz}, I)\\
\Sigma_{\epsilon \epsilon} &\sim& invW(\delta_{\epsilon \epsilon}, I)\\
\end{array}
\]

Given a dataset $D$ with each of its $n$ rows denoting a data point, the sufficient statistic for this model is $$S := D^{\mathsf T}D.$$ For each of notation, let's define
\[
\begin{array}{rcl}
    \eta_x^2 &:=& \Sigma_{{\epsilon \epsilon}_{11}}\\
    \eta_{xy} &:=& \Sigma_{{\epsilon \epsilon}_{12}}\\
    \eta_y^2 &:=& \Sigma_{{\epsilon \epsilon}_{22}},\\
\end{array}
\]
\noindent $\eta_{xy}$ is what in the paper we call $\rho\eta_x\eta_y$, but there isn't much of a reason to focus on the correlation coefficient here.

The \emph{model covariance matrix} $\Sigma(\Theta)$ is given by
\[
\begin{array}{rcl}
    \Sigma(\Theta)_{zz} &:=& \Sigma_{zz}\\
    \Sigma(\Theta)_{zx} &:=& \Sigma_{zz}\beta\\
    \Sigma(\Theta)_{xx} &:=& \beta^{\mathsf T}\Sigma_{zz}\beta + \eta_x^2\\
    \Sigma(\Theta)_{xy} &:=& \Sigma(\Theta)_{zx}\gamma + \eta_x^2\times \theta + \eta_{xy}\\
    \Sigma(\Theta)_{zy} &:=& \Sigma_{zz}\gamma + \Sigma(\Theta)_{zx}\times \theta\\
    \Sigma(\Theta)_{yy} &:=& 
    \gamma^{\mathsf T}\Sigma_{zz}\gamma +
    2 \times \gamma^{\mathsf T}\Sigma_{zx}(\Theta) \times \theta +\\&&    
    \theta^2 \times \Sigma(\Theta)_{xx} + 2 \times \theta \times n_{xy} + \eta_y^2.\\
\end{array}
\]
With that, the log-likelihood function is
$$
L(\Theta) := -0.5 \times trace(\Sigma(\Theta)^{-1} S) - 0.5 \times n \times \log(|\Sigma(\Theta)|),
$$
\noindent where the columns/rows of $S$ are sorted in the same way as the columns/rows of $\Sigma(\Theta)$.

Log-likelihod + log-priors gives the model function that a package like Stan should be able to handle. The pain is encoding the above in Stan's own silly language instead of just using the damn language you are already using anyway (I hate hate hate hate probabilistic programming). Hopefully you can find a package that can just take a function handler to whatever Python or R straight coding of the above, and call a black-box sampler on it. 

Given that $\Sigma_{zz}$ and $\Sigma_{\epsilon \epsilon}$ must be positive definite, we may also to further break them down as e.g.
$$\Sigma_{zz} := C_{zz}C_{zz}^{\mathsf T},$$
\noindent where $C_{zz}$ is a lower-triangular matrix, and its prior can be a bunch of independent univariate Gaussians. Likewise, $\kappa$ may need to be parameterized as
$$\kappa := \Phi(W),$$
\noindent where $\Phi(\cdot)$ is the standard Gaussian cdf and $W$ is a standard Gaussian random variable.

\section{Comparison}

For a given synthetic model $\Theta_0$, compute the population lower bound $L_0$ and upper bound $U_0$.

Choose $(m, n)$ to generate $m$ datasets of size $n$. Choose a level $\alpha$ of coverage.

For each of the $m$ datasets $D$, get the posterior left $\alpha$ tail for $\theta$, and the posterior $1 - \alpha$ tail.

Count as ``lower bound success'' if the left tail contains $L_0$, and ``upper bound success'' for $U_0$ analogously.

Return the frequency of success in the $m$ trials as the marginal coverage of each bound. Alternatively we can use the $\alpha / 2$ and $1 - \alpha / 2$ tails and count as success only if we simultaneously cover both bounds.

Do the equivalent with the same datasets for our method. Compare coverages.

Notice that the above is for only 1 (one) coverage trial. Repeating it for several synthetic models takes time. 

\end{document}