\section{DISCUSSION AND CONCLUSIONS}
\label{sec:discussion}
%\sd{Add some conclusions on theoretical results, can modify this based on final structure of paper}
     True treatment assignment mechanisms in observational studies are rarely known. Mis-specified propensity score models and outcome models may lead to biased treatment effect estimation~\citep{Kang_2007, Lenis2018-so}.  
     We proposed a simple technique to perform post-hoc calibration of the propensity score model. % and examined the theoretical properties of calibration for treatment effect estimation. 
     We show that calibration is a necessary condition to obtain accurate treatment effects and calibrated uncertainties improve propensity scoring models.  Empirically, we show that our technique reduces bias in estimates across a range of treatment assignment functions and base propensity score models. %As compared to calibration by optimizing the covariate balancing property~\citep{Imai2014-wi}, our procedure is simpler and does not require any modification to the training of the base propensity score model. 
     Propensity score models over high-dimensional, unstructured covariates like images, text, and genomic sequences are harder to specify, and we show that we can improve treatment effect estimates for such covariates over a range of base models including the popular logistic regression. We can calibrate simpler models like Naive Bayes over high-dimensional covariates and obtain higher computational throughput while maintaining competitive performance as measured by the error in treatment effect estimation. 
     \paragraph{Limitations of Calibrated Propensities.} Calibration can ensure accurate causal effect estimates when the propensity score model Q can discriminate between different treatments. For example, if the propensity model outputs the marginal treatment distribution, i.e., $Q(T|X) = P(T)$, then $Q$ is perfectly calibrated but cannot estimate accurate treatment effects. Ensuring that Q can discriminate between different treatments is a strong condition and we discuss this further in Appendix~\ref{apdx:cal-improves-accuracy1}. When we use calibrated propensity scores for causal effect estimation, we assume that the observed covariates contain information on all the confounders. In the presence of unobserved confounders that cannot be recovered, calibrating the propensity scores will not be helpful. 
     


% \paragraph{Limitations and Future Directions}
% We perform an empirical evaluation for observational studies with binary treatments, but our calibration procedure can be potentially applied to multi-valued and continuous treatments. We leave this as future work. Our GWAS experiments were performed on a range of standard simulation models, but it will be interesting to extend these experiments to include non-genetic covariates, a higher number of SNPs, and real-world genotype matrices. Additionally,  the calibration of outcome models is an exciting direction for future work.
\newpage