% \newpage
\onecolumn

\title{Calibrated Propensity Scores for Causal Effect Estimation (Supplementary Material)}
\maketitle
% \aistatstitle{Calibrated Propensity Scores for Causal Effect Estimation (Supplementary Material and Code)}
\appendix

%Please find our code at this anonymous link \href{https://anonymous.4open.science/r/CalibratedPropensitiesDemo-0529/README.md}{https://anonymous.4open.science/r/CalibratedPropensitiesDemo-0529/README.md}

\section{COMPARING UNCERTAINTY CALIBRATION WITH OTHER NOTIONS OF CALIBRATION FOR CAUSAL EFFECT ESTIMATION}
\label{apdx:comparison-with-related-work}
Since true propensity model is often unknown in observational studies, the model we use to learn it is likely misspecified. Different parametric and non-parametric models have been proposed to learn propensity scores~\citep{McCaffrey2004-cr, hirano2003efficient, Imbens2004-uy, Lee2010-eu}. Various strategies have been proposed to improve (and calibrate) the propensity score weights. 
\paragraph{Trimming and Overlapping Weights.} When using inverse propensity score weights in causal effect estimators, small deviations in propensity score values can cause large errors in treatment effect estimation. Hence, several strategies have been proposed to trim the extreme propensity score weights~\citep{crump2009dealing, li2018addressing, li2018balancing}. Trimming is known to introduce bias in causal effect estimates as we may lose the information on the magnitude of propensity scores from units that correspond to differences in covariate distributions. It is hard to determine the optimal trimming threshold upfront without sufficient knowledge of the observational study. This problem becomes more pronounced as we increase the complexity of the problem (e.g., multiple treatments~\citep{lopez2017estimation}). Calibration, on the other hand, does not throw away the information contained within propensity scores weights below an arbitrarily chosen threshold. At the same time, it ensures that we do not produce propensity scores lower than the true propensity score (Theorem~\ref{variance-reduction}). 
 Overlapping weights avoid extreme propensity weights by modifying the target population to include units that are more likely to obtain either of the binary treatments\citep{li2018balancing}, while uncertainty calibrated propensities do not need to modify the target population. %Thus, calibrated propensities can be used to estimate causal effects for various target populations, e.g., average treatment effect (ATE), average treatment effect of the treated (ATT) and average treatment effect for the overlapping population (ATO). 

\paragraph{High-dimensional Covariates.} Propensity score models also become unstable and show high variance when covariates are high-dimensional.  When performing a direct adjustment of confounding without propensity scores, the estimation problem becomes more complex as the number of covariates increases (e.g., insufficient number of units to estimate outcome reliably for each combination of covariates). In our experiments with genome-wide association studies, we show that simple propensity models can be used for causal effect estimation with high-dimensional covariates through uncertainty calibration. Thus, calibrating propensities can allow us to estimate causal effects with simple (and potentially mis-specified) propensity score models when applying g-computation is infeasible due to high dimensionality.

\paragraph{Covariate Balancing Calibration.} Since the true propensity model is not known, researchers often modify and refit the propensity score model until covariate balance is achieved. Several techniques have been proposed to avoid this cyclic procedure and obtain covariate balance during optimization of the propensity weights ~\citep{hainmueller_2012, Imai2014-wi, ning2018robust, zhao2017covariate, zubizarrata2015jose, Chan2016-lf}.  Covariate balancing calibration is based on this idea and it solves an optimization problem such that we find weights that balance any averaged function of the covariates in treatment and control groups~\citep{benmichael2021balancing}. While these approaches show theoretical and empirical success in improving causal effect estimation, choices such as setting the appropriate balance conditions within the optimization problem require substantial knowledge of the observational study.  Weights from the true propensity score are a solution to these balancing conditions. However, designing appropriate covariate balancing conditions becomes harder as the dimensionality of the covariates increases. This is more challenging in the presence of covariate interactions (e.g., certain combinations of covariates representing socio-economic variables make an individual more likely to take up smoking as a treatment variable) and continuous covariates~\citep{benmichael2021balancing}. Uncertainty calibration of a potentially misspecified propensity score model does not change the base model optimization procedure and  is simpler to implement on high-dimensional covariates. Thus, it can be more effective when we do not have enough information on the observational study to calibrate (optimize) using appropriate covariate balancing conditions. 

\paragraph{Causal Isotonic Calibration.} ~\citet{Van_Der_Laan2023} propose causal isotonic calibration to improve the estimation of heterogeneous treatment effects (HTEs). Their work enforces a different notion of calibration on the HTE prediction: The average HTE of units with a given predicted HTE is equal to the shared predicted value. The goal of their work is to ensure more directly that the predicted HTE outcome is reliable for different sub-groups of the population. \citet{yadlowsky2022calibrationerror} propose a technique to compute the calibration error while estimating heterogeneous treatment effects (HTEs) following this definition of calibration. Our work, on the other hand, calibrates the uncertainty outcome of the propensity score model that weighs the treated and control outcomes to achieve covariate balance. Our definition of calibration ensures that the number of units receiving treatment, given X \% predicted probability of receiving treatment, is equal to X \%. Although both calibration methods can be implemented using isotonic regression (with/without cross-validation splits to train the recalibrator), the calibration guarantees are different. Our definition ensures that we avoid extreme propensity weights while balancing covariates and improve the error bounds on causal effect estimates. Applying our method to calibrate propensity scores in HTE estimation could be an interesting way to reduce the issue with extreme propensity weights while performing causal isotonic regression~\citep{Van_Der_Laan2023} (e.g., in the case of high-dimensional/complex covariates). Although we only present results with the ATE metric in our main paper, our method can be applied to HTE estimation. Table~\ref{table:toy-expr-pehe} in Appendix~\ref{apdx:additional-experiments} demonstrates that propensity score calibration also improves HTE estimation consistently in the drug effectiveness experiment from Table~\ref{table:toy-expr} in the main paper.  It is also possible to apply our method independently to avoid extreme propensity scores when estimating the calibration error as proposed by ~\citet{yadlowsky2022calibrationerror}.

% While causal isotonic regression~\citep{Van_Der_Laan2023} performs calibration of HTEs more directly, it requires the propensity score model or outcome model to be sufficiently accurate for their approach to work. Ensuring this is difficult in practice, especially when we do not know the true treatment assignment mechanism in observational studies. Our uncertainty calibration method can be applied to mis-specified propensity models (possibly producing extreme weights) as demonstrated in several experiments. 


% They do not propose techniques to enforce their notion of calibration, but demonstrate the effectiveness of their method to compute calibration error as a metric while estimating HTEs. Our work proposes methods to enforce propensity score calibration and demonstrates consistent improvement in the accuracy of ATE estimates. Table 6 in our Appendix demonstrates that propensity score calibration also improves HTE estimation consistently in the drug effectiveness experiment

%Additionally,  However, it is hard to ensure this in practice, especially when the true treatment assignment mechanism is unknown in observational studies. Our method can be applied to mis-specified propensity score models that produce extreme weights. Hence, 

\paragraph{Other Ideas.} Other notions of propensity score calibration have been discussed in literature spanning survey sampling, missing data problems and causal inference~\citep{Lee2009Estimation, Sturmer2007-ko}. However, these methods utilize a different setup (for example, access to validation dataset with information on extra variables) to perform calibration. Our method performs calibration under absence of hidden confounding and does not require accessing extra  datasets (our calibration dataset can be generated with cross-validation). 

\section{ESTIMATORS FOR AVERAGE TREATMENT EFFECTS}

\label{apdx:estimators}
We expressed ATE as $\tau = \mathbb{E} \bigg(\frac{TY}{e(X)} - \frac{(1-T)Y}{1-e(X)}\bigg)$. Following ~\citet{smith2020tutorial}, we can simplify the following term
\begin{align*}
\mathbb{E} \bigg[\frac{TY}{e(X)}\bigg] &= \mathbb{E} [\mathbb{E}\bigg(\frac{TY}{e(X)}| T, X\bigg)] \\
&= \mathbb{E} [\bigg(\frac{T \mathbb{E}(Y|T, X)}{e(X)}\bigg)] \\
&= \mathbb{E} [\bigg(\frac{T \mathbb{E}(Y|T=1, X)}{e(X)}\bigg)] \\
&= \mathbb{E} [\mathbb{E}\bigg(\frac{T \mathbb{E}(Y|T=1, X)}{e(X)}|X \bigg)] \\
&= \mathbb{E} [\bigg(\frac{\mathbb{E}(Y|T=1, X) P(T=1|X)}{e(X)} \bigg)] \\
&= \mathbb{E} [\mathbb{E} (Y|T=1, X)].
\end{align*}
Similarly, 
\begin{align*}
\mathbb{E} \bigg[\frac{(1-T)Y}{1-e(X)}\bigg] &= \mathbb{E} [\mathbb{E} (Y|T=0, X)]. 
\end{align*}

Thus, we can show that ATE is indeed equivalent to $ \mathbb{E} \bigg(\frac{TY}{e(X)} - \frac{(1-T)Y}{1-e(X)}\bigg)$. 

Due to sensitivity of the IPTW estimator toward mis-specification of propensity score model, ~\citet{Robins1994estimation} propose doubly robust Augmented Inverse Propensity Weighted (AIPW) estimator for ATE. The AIPW estimate is asymptotically unbiased when either the treatment assignment (propensity) model or the outcome model is well-specified. %, but this assumption is rarely satisfied in real world. 

We define the outcome model as $f(X=x, T=t)$ to approximate the outcome $Y[X=x, T=t]$ as defined in Section~\ref{sec:background}.

With this, we define the AIPW estimator as 
\begin{align*}
    \hat{\tau} &= \frac{1}{n}\sum_{i=1}^n \Bigg[f(X_i, T=1) - f(X_i, T=0) + \frac{T_i (Y_i-f(X_i, T=1))}{e(X_i)} - \frac{(1-T_i)(Y_i-f(X_i, T=0))}{1-e(X_i)}\Bigg] 
\end{align*}
\section{ADDITIONAL DETAILS ON THE CALIBRATION ALGORITHM}
\label{apdx:additional-details-on-cal-algorithm}
Algorithm~\ref{alg:cal_prop_scoring} depends linearly on the number of data-splits created (training set and calibration set) in addition to the time-complexity of training the propensity model $Q(T|X)$ and recalibrator (Algorithm~\ref{alg:recalibrate}). The time complexity will also depend on an additive term corresponding to computing  $R \circ Q$ for all data-points in dataset $\mathcal{D}$. Space complexity depends linearly on the size of dataset $\mathcal{D}$ together with additive terms for model size of $Q(T|X)$ and $R$. 

\paragraph{Designing the Recalibration Method.} When the treatments are binary, we can choose between isotonic regression and logistic regression as the recalibrator. Since isotonic regression is prone to overfitting, we prefer to use logistic regression when the calibration dataset size is small (e.g., <1000 data points). %The choice of proper loss function is dependent on the choice of recalibrator. For example, log loss is an appropriate proper loss function for learning the logistic regression recalibrator model.
Leave-one-out cross-validation splits could be useful to generate the calibration dataset when the dataset size is small. When moving to the multiple treatment/ continuous treatment setup, designing the recalibrator may involve more choices (for example, we can have a simple neural network as a recalibrator in the case of continuous treatments). Using a separate cross-validation dataset would help select these hyperparameters.

\paragraph{Cross-validation Splits. }
The requirement to allocate a separate calibration dataset may reduce the size of dataset available for training the propensity score model $Q(T|X)$. Hence, we can use cross-validation splits in the dataset to calibrate a propensity score model. To implement this approach, we divide our dataset D into $k$ partitions ${S_1, S_2,..,S_k}$. For each dataset split $S_k$, we train the propensity score model $Q_k(T|X)$ on $S_k$ and and generate parts of recalibrator training dataset (as defined in Algorithm 2) as $C_k = \{Q_k(x), y | x, y \in D - S_k\}$. After this, we can take a union over all $C_k$ to generate the complete recalibrator training dataset. This allows us to use the entire available dataset for training the propensity score model as well as the recalibrator. This can be useful especially when the available dataset size is small. In our experiments, we have used leave-one-out cross-validation splits (thus, each partition $S_k$ is of size n-1 where n is the size of dataset D).



\section{DRUG EFFECTIVENESS SIMULATIONS}
The covariates contain gender ($x_1$), age ($x_2$) and disease severity ($x_3$), while treatment ($t$) corresponds to administration of drug. Outcome ($y$) is the time taken for recovery of patient. 

We simulate the covariates as
\begin{align*}
    x_1 \sim \text{Bernoulli}(0.5) && x_2 \sim \text{Gamma}(\alpha=8, \beta=4) && x_3 \sim \text{Beta}(\alpha=3, \beta=1.5).
\end{align*}
The outcome is simulated as 
$$y \sim \text{Poisson}(2+0.5 x_1+0.03 x_2+2 x_3-t).$$
The treatment $t$ is assigned on the basis of the covariates age, gender and severity of disease defined above. The simulations differ in their treatment assignment functions, which are described as follows
\begin{enumerate}
    \item Simulation A: If $(x_1=1)$, set $ t=(x_2 > 45)$ else set $t=(x_3 > 0.3).$
    \item Simulation B: If $(x_1=1)$, set $ t=(x_3 > 0.3)$ else set $t=(x_2 > 40).$
    \item Simulation C: If $x_2 > 50 \text{ AND } x_3>0.7$ then set $t=1$ else $t=0$.
    \item Simulation D: If $x_2 > 50 \text{ XOR } x_3>0.7$ then set $t=1$ else $t=0$.
\end{enumerate}
For a linear model predicting treatment given covariates, Simulation C is easier to learn as compared to A, B and D. 

Table~\ref{apdx:table:comp-basemodels} works with a slightly modified simulation D, where the treatment is set to 1 with probability of 0.99 when the XOR condition is true (otherwise 0), while it is set to 0 with probability 0.99 when the condition is false. 

\paragraph{Experimental Setup.} We model the outcome using random forests such that the covariates and treatment is taken as input. Logistic regression is used as the propensity score model and the inverse propensity scores are used to weigh each sample while training the outcome model. We use isotonic regression as the recalibrator. % and we use 10 cross-validation splits to generate the calibration dataset.
The treatment effect is expressed as the ratio $\mathbb{E}(Y(1))/\mathbb{E}(Y(0))$, where $Y(T)$ is the potential outcome $Y$ obtained by setting treatment to $T$. The outcome is time taken by the patient to make full recovery from the disease. We use 10 cross-val splits to generate the recalibration dataset. 

The trimming baseline clips propensity weights to threshold of 0.001. Thresholds of 0.001-0.01 are applied commonly when using causal effect estimators based on inverse propensities. 

The experiments were run on a laptop with 2.8GHz quad-core Intel i7 processor. 

In Figure~\ref{apdx:drug-effectiveness}, we see that the calibration curve of propensity score model gets closer to the diagonal after applying recalibration.
\begin{figure}[H]
\centering
\vspace{1 cm}
\includegraphics[scale=0.30]{images/calib_curve_simA.png}
\includegraphics[scale=0.30]{images/calib_curve_simB.png}
\includegraphics[scale=0.30]{images/calib_curve_simC.png}
\includegraphics[scale=0.30]{images/calib_curve_simD.png}

\caption{Calibration of propensity score model for Drug Effectiveness Study.} 
% \vspace{-1cm}
\label{fig:apdx:calib_curve_simA}
\end{figure}
\label{apdx:drug-effectiveness}
\section{UNSTRUCTURED COVARIATES EXPERIMENT}

\label{apdx:unstructured-covars}
Following ~\citet{louizos2017causal}, we generate a synthetic observational dataset consisting of binary variables $X, T, Y \sim \mathbb{P}$, such that
\begin{align*}
    \mathbb{P}(Z =1) = \mathbb{P}(Z=0) = 0.5 && 
    \mathbb{P}(X=1|Z=1) = 0.3 &&  \mathbb{P}(X=1|Z=0) = 0.1  \\
    \mathbb{P}(T=1|Z=1) = 0.4 &&   \mathbb{P}(T=1|Z=0) = 0.2 &&
    Y = T \oplus Z. \\    
\end{align*}
~\citet{louizos2017causal} show that the true ATE under this simulation is zero. The presented results include propensity weight trimming by threshold of 0.01. %We would like to note that the presence of hidden confounder $Z$ implies that ignorability is not satisfied in this experiment. 

The simulation generation as well as ATE estimation experiments were done on a laptop with 2.8GHz quad-core Intel i7 processor. 
\section{SIMULATED GWAS DATASETS}

\label{apdx:sim-gwas}
We have $N$ individuals and $M$ number of total SNPs for each individual. Thus, we need to simulate a SNP matrix $G \in \{0, 1\}^{N \times M}$ and an outcome vector $Y \in \mathbb{R}^N$. We also have a matrix of confounding variables $Z \in \mathbb{R}^{N \times D}$ for these $N$  individuals. We do not observe the confounding variables. Following \citet{wang2019blessings}, we generate the following genotype simulations. 

To generate the SNP matrix, we generate an allele frequency matrix $F \in \mathbb{R}^{N \times M}$ such that $F = S\Gamma^\top, $ where $S \in \mathbb{R}^{N \times D}$ encodes genetic population structure and $\Gamma \in \mathbb{R}^{M \times D}$ maps how structure affects alleles. 

Thus, $g_{ij} \sim \text{Binomial}(1, F_{ij})$. 

The outcome is modeled as 
$ Y = \beta^T G + \alpha^T Z + \epsilon,$
where $\beta$ is the vector of treatment effects for each SNP, $\alpha$ is the vector of coefficients corresponding to the hidden confounders in $Z$ and $\epsilon$ is noise distributed independently as a Gaussian. 

We simulate a high signal-to-noise ratio while simulating outcomes by replacing $\lambda_i = (\alpha^T Z)_i$ as  
\begin{align*}
    \lambda_i \leftarrow \Bigg[\frac{s.d.\{\sum_{j=1}^{m}\beta_jg_{ij}\}_{i=1}^N}{\sqrt{\nu_{gene}}}\Bigg]\Bigg[\frac{\sqrt{\nu_{conf}}}{s.d.\{\lambda_i\}_{i=1}^N}\Bigg]\lambda_i \\
    \epsilon_i \leftarrow \Bigg[\frac{s.d.\{\sum_{j=1}^{m}\beta_jg_{ij}\}_{i=1}^N}{\sqrt{\nu_{gene}}}\Bigg]\Bigg[\frac{\sqrt{\nu_{noise}}}{s.d.\{\epsilon_i\}_{i=1}^n}\Bigg]\epsilon_i,
\end{align*}
where $\nu_{gene} = 0.4, \nu_{conf} = 0.4,$ and $\nu_{noise} = 0.2$.

Below, we reproduce the simulation details as described by \citet{wang2019blessings}. $\Gamma$ and $S$ are simulated in different ways to generate the following datasets. 

\begin{enumerate}
    \item \textbf{Spatial Dataset}: The matrix $\Gamma$ was generated by sampling $\gamma_{ik} \sim 0.9 \times \text{Uniform}(0,0.5)$ ,
for $k = 1,2$ and setting $\gamma_{ik} = 0.05$. The first two rows of S correspond to coordinates for each individual on the unit square and were set to be independent and identically distributed samples from Beta$(\alpha, \alpha), \alpha = 0.1, 0.3, 0.5,$ while the third row of $S$ was set to be 1, i.e. an intercept. As $\alpha \implies 0$, the individuals are placed closer to the corners of the unit square, while when $\alpha = 1$, the individuals are distributed uniformly. 
    \item \textbf{Balding-Nichols Model (BN)}: Each row i of $\Gamma$ has three independent and identically distributed draws taken from the Balding- Nichols model: $\gamma_{ik} \sim BN(p_i, F_i)$, where $k \in {1,2,3}$. The pairs $(p_i,F_i)$ are computed by randomly selecting a SNP in the HapMap data set, calculating its observed allele frequency and estimating its FST value using the Weir \& Cockerham estimator~\citep{weir1984estimating}. The columns of $S$ were Multinomial(60/210,60/210,90/210), which reflect the subpopulation proportions in the HapMap dataset. 
    \item \textbf{1000 Genomes Project (TGP)}~\citep{1000_Genomes_Project_Consortium2015-wg}: The matrix $\Gamma$ was generated by sampling $\gamma_{ik} \sim 0.9 \text{Uniform} \times (0,0.5)$ ,
for $k = 1,2$ and setting $\gamma_{ik} = 0.05$. In order to generate $S$, we compute the first two principal components of the TGP genotype matrix after mean centering each SNP. We then transformed each principal com- ponent to be between (0,1) and set the first two rows of $S$ to be the transformed principal components. The third row of $S$ was set to 1, i.e. an intercept.
    \item \textbf{Human Genome Diversity Project (HGDP)}~\citep{hgdp, hgdp2020}: Same as TGP but generating S with the HGDP genotype matrix.
\end{enumerate}

These simulations and the ATE estimation experiments were all done on a laptop with 2.8GHz quad-core Intel i7 processor.  The presented results include propensity weight trimming by threshold of 0.01 (after applying a possible calibration step).

\section{ADDITIONAL EXPERIMENTAL RESULTS}

\label{apdx:additional-experiments}

For the Drug Effectiveness simulations, Table~\ref{apdx:table:comp-basemodels}, we  compare a range of base propensity score models where the true treatment assignment function is non-linear logical XOR (Appendix ~\ref{apdx:drug-effectiveness}). We see the benefits of calibration across varying degrees of mis-specification in the base model. After calibration, non-linear MLP and SVM (RBF) show the best $\varepsilon_{ATE}$, while mis-specified linear models like logistic regression also show consistent reduction in $\varepsilon_{ATE}$. We observe a greater reduction in bias ($\varepsilon_{ATE}$) with lowering ECE. 


Table~\ref{table:toy-expr-pehe} extends Table~\ref{table:toy-expr} with the PEHE metric on all the simulation settings. 

For the GWAS experiments, we provide a complete table of dataset simulations and  acomparison against different base propensity models in Table~\ref{table:apdx:gwas-basic} and Table~\ref{table:apdx: gwas-comp} respectively.  %Table~\ref{apdx:table:calib_nv_vs_lr} also shows the comparison of calibrated naive bayes with logistic regression with both IPTW and AIPW estimators. 


\begin{table*}[ht]
\caption{Recalibrating the Output of Propensity Score Model Results in Lower Error in Estimating Causal Effects. Reduction in ECE ($\Delta (ECE)$) implies that the calibration of the model improves with our technique. Results consisting of  PEHE are averaged over 10 experimental repetitions and braces contain the standard error.}
%The baselines consist of weighing with plain propensities (Plain), trimmed propensities (Trim), stabilized weights (SW) and covariate balancing (Cov. Bal.).
% \vspace{-0.6cm}
% \caption{Recalibrating Propensities. True ATE in all the simulations below is 0.368.}
%Reduction in ECE ($\Delta (ECE)$) implies that the calibration of the model improves with our technique. Results consisting of  $\varepsilon_{ATE}$ are averaged over 10 experimental repetitions and braces contain the standard error.
\small
\centering


\begin{tabular}{llcccc}
\toprule
% 
%&{Setting} & \multicolumn{4}{c}{$\varepsilon_{ATE}$} \\%& \multicolumn{4}{c}{PEHE} \\
 & Setting & Sim A & Sim B & Sim C & Sim D \\%& Sim A & Sim B & Sim C & Sim D \\
\midrule
 & Naive & 0.263 (0.002) & 0.075 (0.002) & 0.105 (0.002) & 0.103 (0.003)\\
% & T learner & 0.098 (0.005) & 0.200 (0.001) & 0.153 (0.002) &  0.058 (0.006) & 0.017 (0.001) & 0.062 (0.001) & 0.047 (0.001) & 0.035 (0.001)\\
\midrule
 & Plain propensities & 0.149 (0.024) & 0.068 (0.001) & 0.052 (0.001) & \textbf{0.031 (0.001)}\\
 & Trimmed ~\citep{Lee2011-nv} & 0.245 (0.004) & 0.067 (0.001) & 0.046 (0.001) & \textbf{0.031 (0.001)}\\
& Stablized Wt~\citep{Xu2010-jv} & 0.195 (0.013) & 0.076 (0.002) & 0.114 (0.004) & 0.112 (0.005)\\
\midrule
 & Covariate Balancing~\citep{tan2020regularized} &\multirow{1}{*}{0.280 (0.003)} & \multirow{1}{*}{\textbf{0.056 (0.001)}} & \multirow{1}{*}{0.050 (0.003)} & \multirow{1}{*}{0.107 (0.007)}\\
% & \citet{tan2020regularized} & & & & & & & & \\
\midrule
 & Calibrated (Ours) & 0.047 (0.010) & \textbf{0.057 (0.001)} & \textbf{0.042 (0.001)} & \textbf{0.032 (0.001)}\\
  & Calibrated + Trimmed & 0.049 (0.010) & \textbf{0.057 (0.001)} & \textbf{0.042 (0.001)} &  \textbf{0.032 (0.001)} \\
 & Calibrated + Stablized Wt & \textbf{0.030 (0.007)} & \textbf{0.057 (0.001)} & \textbf{0.042 (0.001)} &  0.033 (0.001)\\
 \midrule
 & $\Delta(ECE)$ & 0.010 (0.001) & 0.014 (0.001) & 0.025 (0.002) & 0.019 (0.001) \\
\bottomrule
\end{tabular}
\vspace{-2mm}
\label{table:toy-expr-pehe}
\end{table*}
\begin{table}
% \begin{wraptable}{r}{5.5cm}
% \vspace{-0.8cm}
\caption{Comparison of different base propensity score models. (Sim D)}
% \hspace{0.1cm}
\vspace{0.1cm}
\centering
\small
\begin{tabular}{lccccccr}
\toprule
Base model & $\varepsilon_{ATE}$(Plain) & ECE (Plain) &$\varepsilon_{ATE}$ (Calib) & ECE (Calib)\\
\midrule
Log. Reg. & 0.031 (0.003) & 0.124 (0.001) & 0.016 (0.002) & 0.018 (0.001) \\

 MLP & 0.014 (0.005)& 0.075 (0.002) & 0.008 (0.003) & 0.012 (0.002) \\

 SVM (Linear)  & 0.032  {(0.005)} & 0.126 (0.001)
 & 0.015 (0.003) & 0.017 (0.001)
 \\
 SVM (RBF) & 0.012 (0.003) & 0.020 (0.000)  & 0.009 (0.004) & 0.011 (0.001)
\\
% Random Forests & 0.048
% (0.007) & 0.033
% (0.007) \\
Adaboost & 0.039 (0.003) & 0.296 (0.001) 
 & 0.022 (0.004) & 0.037 (0.008) \\
Naive Bayes & 0.022 (0.004) & 0.146 (0.001) & 0.017 (0.003) & 0.016 (0.002) \\
%  Decision Tree & 0.506 (0.003) & 0.000
% (0.000)
%  & 0.504
% (0.003) & 0.000
% (0.000)\\

 
\bottomrule

\end{tabular}
\vspace{-0.2cm}
\label{apdx:table:comp-basemodels}
% \end{wraptable}
\end{table}

% \begin{table}[ht]
% % \vspace{-0.8cm}
% \caption{Calibration reduces the bias in treatment effect estimation across different base models (Simulation A). }
% \hspace{0.1cm}

% \centering
% \begin{tabular}{lccccr}
% \toprule
% Base classifier & \multicolumn{2}{c}{Plain Propensities} & \multicolumn{2}{c}{Recalibrated Propensities} \\
%  & $\varepsilon_{TE}$ & ECE & $\varepsilon_{TE}$ & $ECE$ \\
% \midrule
%  Logistic Regression  & 0.479  (0.005) & 0.029
% (0.001)
%  & 0.091 (0.022) & 0.017
% (0.001) 
%  \\
%  MLP & 0.455 (0.042) & 0.038
% (0.001) & 0.027 (0.031) & 0.014
% (0.001)
% \\
% %  Decision Tree & 0.506 (0.003) & 0.000
% % (0.000)
% %  & 0.504
% % (0.003) & 0.000
% % (0.000)\\
% % Adaboost & 0.506 (0.003) & 0.000
% % (0.000)
% %  & 0.504
% % (0.003) & 0.000
% % (0.000)\\
%  SVM & 0.485
% (0.004) & 0.041 (0.001) & 0.454
% (0.013)
%  & 0.018 (0.000)\\
%  Naive Bayes & 0.471
% (0.003) & 0.064 (0.000) & 0.021
% (0.018)
%  & 0.003 (0.000)\\
% \bottomrule

% \end{tabular}

% \label{apdx:table:comp-basemodels}
% \end{table}
% \begin{table*}[ht]

% \caption{Reduction in ATE Estimation Error $\varepsilon_{ATE}$ with Structured and Unstructured Covariates.}
% \vspace{0.2cm}
% % \hspace{0.1cm}
% \small
% \centering
% \begin{tabular}{lccccr}
% \toprule
% % % 
% % Setting &  Naive  & \multicolumn{2}{c}{Plain Propensities} &  & \multicolumn{2}{c}{Uncertainty Recalibration} & $\Delta$(ECE) \\

% %  & estimation &  &  & &   &  &  \\
% % \midrule
% %  Image Covariate & 0.187 (0.010) & & 0.107 (0.029) &  &  & 0.095 (0.005) & 0.137 (0.046) \\
% %  Binary Covariate & 0.176 (0.019) & & 0.052 (0.011) &  & & 0.099 (0.008) & 0.112 (0.029)\\

% % 
% Setting &  Naive Est.  & {Plain Propensities} &  {Uncertainty Recalibration} & ECE (before/after calibration) \\
% \midrule
%  Image Covariate & 0.187 (0.010) & 0.107 (0.029) & 0.095 (0.005) & 0.137 (0.046) \\
%  Binary Covariate & 0.176 (0.019) & 0.091 (0.011)  & 0.085 (0.008) & 0.112 (0.029)\\
% \bottomrule
% \vspace{-0.2cm}
% \end{tabular}

% \label{apdx:table:mnist-expr}
% \end{table*}
\begin{table}[H]
% \vspace{-0.8cm}
\caption{GWAS with calibrated propensities. We compare IPTW and AIPW estimates using calibrated propensity scores against several standard GWAS baselines. $\varepsilon_{ATE}$ is the $l_2$ norm of difference between true and estimated marginal treatment effect vector. Under all setups, calibrated propensities improve the treatment effect estimates.}
\hspace{0.1cm}

\centering
\small
\begin{tabular}{lcccccc}
\toprule
Dataset	& Spatial & 	Spatial & 	Spatial & 	Balding & 	HGDP	& TGP \\
& ($\alpha$=0.1)& 	 ($\alpha$=0.3)&  ($\alpha$=0.5)&  Nichols& 	& \\
\midrule
Naive	& 16.23 (0.91)	& 11.76 (0.84)	& 9.81 (0.69)& 	19.25 (1.17)	& 11.82 (0.11)	&  12.24 (0.71) \\
PCA	& 9.60 (0.37)	& 9.54 (0.41)	& 9.38 (0.38) & 	14.12 (1.28) &  	11.69 (0.20) & 	10.73 (0.38) \\
FA	& 9.55 (0.34) & 9.53 (0.44) & 	9.23 (0.30) & \textbf{12.59 (1.05)}	& 11.65 (0.16)	& 10.59 (0.32) \\
LMM	 & 10.24 (0.41) & 9.58 (0.45) & \textbf{8.15 (0.40)} & \textbf{13.13 (1.09)} & \textbf{10.09 (0.35)} & \textbf{9.44 (0.57)} \\
IPTW (Calib) 	& \textbf{8.13 (0.35)} & 	\textbf{8.69 (0.56)} & 	\textbf{8.32 (0.34)}	 & \textbf{13.62 (0.68)} & 	10.86 (0.13) & 	\textbf{9.57 (0.58)} \\
IPTW (Plain) & 12.56 (1.25) & 10.22 (0.81) & 9.09 (0.48) & 14.36 (0.74) & 11.62 (0.12)	& 11.76 (0.86) \\
AIPW (Calib)	& 8.94 (0.29)	& 9.00 (0.58)& 	8.59 (0.39) & 	16.81 (1.39) & 	11.06 (0.12) & 	10.32 (0.43) \\
AIPW (Plain)	& 13.89 (0.76) & 	10.46 (0.72) & 	8.99 (0.51)	& 17.66 (1.33)	& 11.38 (0.11)	& 11.56 (0.65) \\
$\Delta_{ECE}$ & 0.022 (0.001) & 0.016 (0.007) & 0.015 (0.001)& 0.013 (0.002)& 0.011 (0.001)& 0.022 (0.001) \\
\bottomrule
\end{tabular}

\label{table:apdx:gwas-basic}
\end{table}

\begin{table}[ht]
% \vspace{-0.8cm}
\caption{Comparing propensity score models. We compare the AIPW estimate using calibrated propensities and observe reduction in error across a range of base propensity score models.}
\hspace{0.1cm}

\centering
\small
\begin{tabular}{lcccccc}
\toprule
Dataset  &  Metrics  & LR  & MLP  &  Random Forest  & Adaboost  &  NB \\
\midrule
Spatial  & $\varepsilon_{ATE}$ (plain)
 & 13.886 (0.755)
 & 17.403 (1.070)
 & 12.911 (0.612)
 & 16.234 (0.916)
 & 582.731 (64.514) \\
 ($\alpha$=0.1) &  $\varepsilon_{ATE}$ (calib)
 & 8.942 (0.287)
 & 14.661  (0.762)
 & 8.706 (0.322)
 & 8.524 (0.297)
 & 8.526 (0.472) \\
 & $\Delta_{ECE}$ 
 & 0.022 (0.001)
 & 0.072 (0.003)
 & 0.060 (0.001)
 & 0.252 (0.006)
 & 0.281 (0.002) \\
 \midrule
 Spatial 
 &  $\varepsilon_{ATE}$ (plain)
 & 10.460 (0.720)
 & 12.636 (0.730)
 & 10.578 (0.768)
 & 11.764 (0.839)
 & 400.643 (49.301) \\
 ($\alpha$=0.3)  & $\varepsilon_{ATE}$ (calib)
 & 9.000 (0.58)
 & 11.550 (0.747)
 & 9.277 (0.532)
 & 8.909 (0.549)
 & 9.121 (0.535) \\
 & $\Delta_{ECE}$
 & 0.016 (0.007)
 & 0.070 (0.003)
 & 0.063 (0.001)
 & 0.244 (0.006)
 & 0.281 (0.002) \\
 \midrule
 Spatial 
 &  $\varepsilon_{ATE}$ (plain)
 & 8.990 (0.510)
 & 10.408 (0.694)
 & 9.277 (0.518)
 & 9.814 (0.691)
 & 276.017 (24.183) \\
 ($\alpha$=0.5) &  $\varepsilon_{ATE}$ (calib)
 & 8.590 (0.390)
 & 9.728 (0.650) 
 & 8.687 (0.224)
 & 8.520 (0.286)
 & 8.592 (0.216) \\
 & $\Delta_{ECE}$
 & 0.015 (0.001)
 & 0.070 (0.002)
 & 0.065 (0.001)
 & 0.239 (0.007)
 & 0.269 (0.003) \\
 \midrule
 Balding 
 & $\varepsilon_{ATE}$ (plain)
 & 17.660 (1.330)
 & 18.282 (1.267)
 & 18.419 (1.210)
 & 19.248 (1.169)
 & 95.892 (6.350) \\
 Nichols & $\varepsilon_{ATE}$ (calib)
 & 16.810 (1.390)
 & 17.033 (1.391)
 & 16.611 (1.385)
 & 16.938 (1.367)
 & 16.833 (1.392) \\
 & $\Delta_{ECE}$
 & 0.013 (0.002)
 & 0.041 (0.002)
 & 0.052 (0.002)
 & 0.259 (0.010)
 & 0.261 (0.009) \\
 \midrule
HGDP
 &  $\varepsilon_{ATE}$ (plain)
 & 11.380 (0.110)
 & 12.358 (0.197)
 & 11.529 (0.107)
 & 11.816 (0.108)
 & 138.086 (5.086) \\
 &  $\varepsilon_{ATE}$ (calib)
 & 11.060 (0.120)
 & 11.198 (0.106)
 & 11.299 (0.143)
 & 11.070 (0.123)
 & 11.430 (0.133) \\
 & $\Delta_{ECE}$
 & 0.011 (0.001)
 & 0.069 (0.002)
 & 0.053 (0.001)
 & 0.275 (0.006)
 & 0.206 (0.003) \\
 \midrule
TGP
 & $\varepsilon_{ATE}$ (plain)
 & 11.560 (0.650)
 & 11.965 (0.754)
 & 11.677 (0.614)
 & 12.246 (0.713)
 & 87.329 (5.716)\\
 & $\varepsilon_{ATE}$ (calib)
 & 10.320 (0.430)
 & 11.530 (0.633)
 & 10.519 (0.402)
 & 10.244 (0.398)
 & 9.070 (0.316) \\
 & $\Delta_{ECE}$
 & 0.022 (0.001)
 & 0.061 (0.002)
 & 0.070 (0.002)
 & 0.204 (0.007)
 & 0.267 (0.004) \\
\bottomrule
\end{tabular}

\label{table:apdx: gwas-comp}
\end{table}

% \begin{table}[H]
% % \vspace{-0.8cm}
% \caption{Calibrated naive bayes gives us competitive performance with lower computational resources as compared to logistic regression as we increase the total number of SNPs.}
% \hspace{0.1cm}

% \centering
% \small
% \begin{tabular}{lcccccc}
% \toprule
% Method & 	&	 100 SNPs &	500 SNPs &		1000 SNPs \\
% \midrule
% Naive	&	&	22.408 (5.752)	&	31.547 (3.624) &		21.349 (3.640) \\
% PCA		&	& 18.104 (3.641)&		28.115 (2.500)&		19.186 (3.758) \\
% FA	&	&	18.531 (3.286)&		28.03060 (2.5006)&		19.016 (3.852) \\
% LMM		&	&		17.575 (3.409) &	28.638 ( 2.499) &	19.908 (3.592) \\
% \midrule
%  Calibrated NB & & & & \\
%  IPTW & & 17.860 (3.715) & 31.304 (2.603) & \textbf{18.210 (1.705)} \\
%  AIPW & & \textbf{16.770 (2.764)} & \textbf{28.857 (2.780)} & 21.319 (3.704) \\
% 		Throughput (SNPs/sec) & &	34.9	 & 38.7	& 47.6 \\
%   \midrule
%   Plain NB & & & & \\
% IPTW & & 1287.8240 (382.751) &  1850.408 (425.153)  & 1455.992 (185.084) \\
%   AIPW & & 575.163 (112.141) &  353.710 (82.275) & 336.671 (63.148) \\
% 	Throughput (SNPs/sec) & 	& 93.2	&55.6	& 68.6 \\
%  \midrule
%  Calibrated LR & & & & \\
% IPTW & & 17.23742 (3.0538) & 28.963 (2.754) &  23.618 (3.832) \\
% AIPW & &  17.647 (3.208) & 29.489 (2.821) & 22.795 (4.249) \\
% Throughput (SNPs/sec) & & 	43.6	&33.9	& 19.5 \\
%   \midrule
%   Plain LR & & & & \\
% 	IPTW & & 19.297 (3.425)  & 29.309 (2.773)  & 23.525 (4.530) \\
%  AIPW & & 20.652 (3.286) & 30.038  (2.976) &  27.921 (4.713)\\
% 		Throughput (SNPs/sec) &	& 56.5	& 35.5 &	20.1 \\
% \bottomrule
% \end{tabular}
% \label{apdx:table:calib_nv_vs_lr}
% \end{table}



% \begin{table}[H]
% % \vspace{-0.8cm}
% \caption{Calibrated naive bayes gives us competitive performance with lower computational resources as compared to logistic regression as we increase the total number of SNPs.}
% \hspace{0.1cm}

% \centering
% \small
% \begin{tabular}{lcccccc}
% \toprule
% Method & 	&	 100 SNPs &		1000 SNPs \\
% \midrule
% Naive	&	&	22.408 (5.752)	&		21.349 (3.640) \\
% PCA		&	& 18.104 (3.641)&		19.186 (3.758) \\
% FA	&	&	18.531 (3.286)&			19.016 (3.852) \\
% LMM		&	&		17.575 (3.409) &	19.908 (3.592) \\
% \midrule
%  Calibrated NB & & & & \\
%  IPTW & & 17.860 (3.715) & \textbf{18.210 (1.705)} \\
%  AIPW & & \textbf{16.770 (2.764)} & 21.319 (3.704) \\
% 		Throughput (SNPs/sec) & &	34.9	 & 47.6 \\
%   \midrule
%   Plain NB & & & & \\
% IPTW & & 1287.8240 (382.751) & 1455.992 (185.084) \\
%   AIPW & & 575.163 (112.141) &  336.671 (63.148) \\
% 	Throughput (SNPs/sec) & 	& 93.2 & 68.6 \\
%  \midrule
%  Calibrated LR & & & & \\
% IPTW & & 17.23742 (3.0538) &  23.618 (3.832) \\
% AIPW & &  17.647 (3.208) &  22.795 (4.249) \\
% Throughput (SNPs/sec) & & 	43.6	& 19.5 \\
%   \midrule
%   Plain LR & & & & \\
% 	IPTW & & 19.297 (3.425)  &  23.525 (4.530) \\
%  AIPW & & 20.652 (3.286) &  27.921 (4.713)\\
% 		Throughput (SNPs/sec) &	& 56.5 &	20.1 \\
% \bottomrule
% \end{tabular}
% \label{apdx:table:calib_nv_vs_lr}
% \end{table}

