\section{INTRODUCTION}\label{sec:intro}

% \vk{intro currently generated by ChatGPT, let's edit this and make it nice and polished}

This paper studies the problem of inferring the causal effect of an intervention from observational data. For example, consider the problem of estimating the effect of a treatment on a medical outcome or the effect of a genetic mutation on a phenotype. A key challenge in this setting is confounding---e.g., if a treatment is only given to sick patients, it may paradoxically appear to trigger worse outcomes~\citep{greenland1999confounding, VanderWeele2006-yf}.

% Causal inference from observational data is an important problem in many fields, such as healthcare and genomics. One of the major challenges in this problem is confounding, which arises when the observed covariates are not balanced between the treatment and control groups. 
Propensity score methods are a popular tool for correcting for confounding in observational data~\citep{Rosenbaum1983-rp, DAgostino1998-ex,imbens2000therole,  Lanza2013-ul}. %Lunceford2004-zu, VanderWeele2006-yf,These methods estimate the probability of receiving a treatment given observed covariates, and balance covariates based on this probability.
However, propensity score methods can become unreliable when their predictive model outputs incorrect treatment assignment probabilities~\citep{Kang_2007, smith2005matching, Lenis2018-so}. 
For example, when the propensity score model is overconfident (a known problem with neural network estimators~\cite{guo2017calibration}), predicted assignment probabilities can be too small~\citep{tan2017regularized}, which yields a blow-up in the estimated causal effects.
More generally, propensity score weighting stands to benefit from accurate uncertainty estimation \cite{kallus2020deepmatch}. % in the treatment assignment model.
% Oftentimes these probabilities can lead to invalid treatment assignments, and rare events can cause propensity scores to blow up, especially when paired with neural network estimators.
% It is therefore important to study the uncertainty in propensity score estimates, particularly with respect to calibration. 

This work argues that propensity score methods can be improved by leveraging calibrated uncertainty estimation in treatment assignment models.
Intuitively, when a calibrated model outputs a treatment probability of 90\%, then 90\% of individuals with that prediction should be assigned to the treatment group~\citep{Platt99probabilisticoutputs, kuleshov2018accurate}.
We argue that calibration is a necessary condition for propensity score models and it also addresses the aforementioned problems of model overconfidence.
% however, it is typically not maintained in existing propensity score models. 
%
% Calibration is important for several reasons. Incorrect probabilities can lead to invalid treatment assignments, and rare events can cause propensity scores to blow up, especially when using neural network estimators. This work establishes that calibration is a necessary condition for propensity scoring and identifies settings where recalibration provides strict improvements. 

Off-the-shelf propensity score models are typically uncalibrated \cite{kallus2020deepmatch};
our work introduces algorithms that provably enforce uncertainty calibration in these models.  %provides  theoretical analysis, and demonstrates the usefulness of calibrated propensities on several tasks, including genome-wide association studies. We also show that calibration can reduce computational requirements when applied to simpler propensity score models.
Post-processing of propensity weights is often done via trimming~\citep{crump2009dealing, li2018balancing}, but this can introduce bias by eliminating information from propensity weights below a pre-selected trimming threshold~\citep{li2018addressing}. Propensity score calibration reduces variance from extreme propensity weights without introducing bias from trimming thresholds. %but choosing optimal thresholds is dependent on the problem and wrong thresholds can introduce bias~\citep{li2018addressing}. 
Approaches that balance covariates during optimization to obtain propensity weights have demonstrated theoretical and empirical advantages in causal effect estimation~\citep{hainmueller_2012, Chan2016-lf, zhao2017covariate, zubizarrata2015jose, ning2018robust}, but choosing appropriate covariate balancing conditions requires substantial knowledge of the observational study~\citep{benmichael2021balancing}. %Even though uncertainty calibration does not directly optimize covariate balance, 
Uncertainty calibration is simpler to implement and can be combined with any base propensity model and methods like trimming or covariate balancing without changing the model optimization procedure. 

In summary, this paper makes the following contributions: (1) we provide formal arguments that establish calibration as a necessary condition for unbiased treatment effect estimation, %using popular estimators based on inverse propensity scores, 
prove the reduction of variance by avoiding extreme propensity weights and show improved error bounds on the causal effect estimates by enforcing calibration; %of commonly used inverse propensity weighted and doubly robust estimators; explain the benefits of uncertainty calibration in propensity score models; 
(2) we propose simple algorithms that enforce calibration; (3) we provide theoretical guarantees on the calibration and  regret of these algorithms; (4) we demonstrate the effectiveness of calibrated propensities in several tasks and show improvement in the speed of high-dimensional genome-wide association studies (GWASs) by more than two-fold. 
% proposes simple recalibration techniques for ensuring that the probabilistic output of a learned propensity score model is calibrated. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. Our results demonstrate improved causal effect estimation with calibrated propensity scores in several settings, highlighting the importance of calibration in propensity score methods.
