\section{RELATED WORK}\label{sec:related}
Isotonic regression~\citep{mizil2005predicting} and Platt scaling~\citep{Platt99probabilisticoutputs} are used to calibrate uncertainties over discrete outputs. This concept has been extended to regression calibration~\citep{kuleshov2018accurate} and online calibration~\citep{kuleshov2017estimating}. %nd structured prediction~\citep{kuleshov2015structured}. 
Calibrated uncertainties have been used to improve deep reinforcement learning~\citep{malik2019calibrated, pmlr-v162-kuleshov22a}, Bayesian optimization~\citep{deshpande2023calibrated}, etc. %natural language processing~\citep{kumar2019calibration}, %, Conformal prediction has been used to generate calibrated predictive sets and this was extended to work with distribution shifts~\citep{tibshirani2019conformal, Barber2022Conformal}. 
    
\citet{Lenis2018-so} and~\citet{Kang_2007} demonstrate the degradation in treatment effect estimation due to misspecified treatment and outcome models. Various modifications of propensity scores weights and different notions of calibration have been proposed to reduce the bias in treatment effect estimation~\citep{Imai2014-wi, zhao2017covariate, ning2018robust, Van_Der_Laan2023, Sturmer2007-mf, crump2009dealing, li2018balancing, Xu2010-jv}. Appendix~\ref{apdx:comparison-with-related-work} compares our work with these approaches in more details.

As compared to covariate balancing calibration~\citep{Imai2014-wi} that modifies the underlying optimization procedure for obtaining balancing weights, our notion of calibration is simpler to implement and does not modify the optimization of the  propensity score model. Unlike techniques like propensity weight trimming~\citep{crump2009dealing}, our method does not introduce bias from throwing away weights beyond a pre-selected threshold.  \citet{Van_Der_Laan2023} and ~\citet{yadlowsky2022calibrationerror} apply the following notion of calibration for hetereogenous treatment effect (HTE) estimation: The average HTE of units with a given predicted HTE is equal to the shared predicted value.  The goal of causal isotonic regression~\citep{Van_Der_Laan2023} is to ensure more directly that the predicted HTE outcome is reliable for different sub-groups of the population. Our work, on the other hand, calibrates the uncertainty outcome of the propensity score model that weighs the treated and control outcomes to achieve covariate balance.  Although both calibration methods can be implemented using isotonic regression, the calibration guarantees are different. Our definition ensures that we avoid extreme propensity weights while balancing covariates and improve the error bounds on causal effect estimates. Our approach to calibration is applicable to HTE estimation (Appendix~\ref{apdx:additional-experiments}, Table~\ref{table:toy-expr-pehe}) and can be used with mis-specified propensity models that produce extreme weights. Applying our method to calibrate propensity scores in HTE estimation could be an interesting way to reduce the issue with extreme propensity weights while performing causal isotonic regression~\citep{Van_Der_Laan2023}. 

Uncertainty calibration can be combined independently with other methods like trimming~\citep{crump2009dealing, li2018balancing}, stabilized weights~\citep{Xu2010-jv},  covariate balancing techniques~\citep{hainmueller_2012, Chan2016-lf, zhao2017covariate, zubizarrata2015jose, ning2018robust}, etc. to improve the quality of ATE estimates. 
%Several techniques have been proposed to trim extreme propensity weights~\citep{crump2009dealing, li2018balancing}, but trimming can introduce bias while reducing variance of the estimates~\citep{li2018addressing}. Propensity score calibration reduces variance from extreme propensity weights without increasing the bias. %but choosing optimal thresholds is dependent on the problem and wrong thresholds can introduce bias~\citep{li2018addressing}. 
%Approaches that balance covariates during optimization to obtain propensity weights have demonstrated theoretical and empirical advantages in causal effect estimation~\citep{hainmueller_2012, Chan2016-lf, zhao2017covariate, zubizarrata2015jose, ning2018robust}, but choosing appropriate covariate balancing conditions requires substantial knowledge of the observational study~\citep{benmichael2021balancing}. 

%~\citet{Lin2011-tv} rigorously define propensity score-based techniques to correct for confounding in Genome-Wide Association Studies (GWASs). ~\citet{Zhao2009propensityscore, Zhao2012analyzinggenetic} propose techniques to balance both genetic and non-genetic covariates using propensity scores. % ~\citet{Zhao2018-ht} combine propensity scores with linear dimension reduction for improved efficiency. 
%Other techniques to correct for confounding in GWAS include Principal Components Analysis~\citep{price2006pca}, Genomic Control~\citep{Devlin1999-mr}, Stratification Scores~\citep{Epstein2007-tz} and Linear Mixed Models~\citep{lippert2011fast}. 