

\section{CALIBRATED PROPENSITY SCORES}

% \subsection{Uncertainty Estimation for Propensity Scoring}

% Propensity scoring methods rely on a model $Q(T|X)$ to approximate the true probability $P(T|X)$. 
% If $Q(T=t|X)$ accurately represents the probability of assigning treatment $t$ to an individual with covariates $X$, an IPTW estimator will correctly estimate the treatment effect.
%
We start with the observation
% Crucially, 
that a good propensity scoring model $Q(T|X)$ must not only correctly output the treatment assignment, but also accurately estimate predictive uncertainty. Specifically, the {\em probability} of the treatment assignment must be correct, not just the class assignment. 
While a Bayes optimal $Q$ will perfectly estimate uncertainty, suboptimal models will need to balance various aspects of predictive uncertainty, such as calibration and sharpness.
This raises the question: what predictive uncertainty estimates work best for causal effect estimation using propensity scoring?

% % This paper argues that ideal propensity scores must be calibrated. 
% While a Bayes optimal predictive model is perfectly calibrated and sharp, most models are not optimal, and we need to balance these two distinct properties.
% % This paper argues that ideal propensity scores must be calibrated.
% We provide intuitive and formal arguments that calibration is a particularly important and necessary condition for accurate propensity scoring, and then derive 
% % and derives
% calibration algorithms that improve IPTW estimators.
% % We formalize this intuition below, and we provide examples where and IPTW estimator fails when it is not calibrated.
% % \vk{todo: put small example from proof into the appendix (after 5/17 is okay)}
% % Fortunately, calibration is also a property that is enforceable; in the next section, we provide algorithms that ensure calibration in the next section and show that miscalibration error decreases as $O(1/\sqrt{n})$, where $n$ is the size of a small additional recalibration dataset. Additionally, we will see that calibration can be enforced in a post-hoc manner without impacting initial model performance, as measured by a proper loss function.

% This paper argues that ideal propensity scores must be calibrated. Intuitively, if the model predicts that the probability of assignment of a treatment is 80\%, then 80\% of these predictions should indeed receive the treatment. If the predicted number is larger or smaller,  the downstream IPTW estimator will either overcorrect or undercorrect for the biased treatment allocation. In other words, calibration is a {\em necessary condition} for a correct propensity scoring model.

% Are predictions from a propensity score model calibrated out of the box?
% When a predictive model is Bayes optimal, its forecasts are calibrated and sharp.
% However, most models are not optimal, and we need to balance these two distinct properties.
% Our paper provides argument why calibration is more important than sharpness.

% This paper demonstrates that the calibration-sharpness tradeoff significantly impacts downstream performance. In particular, we argue that it is much {\bf better to be calibrated than sharp}. We provide intuitive and formal arguments for this claim, and then derive calibration algorithms that improve decision-making performance.

\subsection{Calibration: A Necessary Condition for Propensity Scoring Model}
% \sd{Rewrite 'necessary condition' - since theorem 3.1 does not establish this for all outcome functions?}
This paper argues that calibration improves propensity-scoring methods. Intuitively, if the model $Q(T=1|X)$ predicts a treatment assignment probability of 80\%, then 80\% of these predictions should receive the treatment. If the prediction is larger or smaller,  the downstream IPTW estimator will overcorrect or undercorrect for the biased treatment allocation; see below for a simple example.

In other words, calibration is a {\em necessary condition} for a correct propensity scoring model. We formalize this intuition below, and we provide examples in Appendix~\ref{apdx:calibration-necessary} where an IPTW estimator fails when it is not calibrated.
%\vk{todo: put small example from proof into the appendix (after 5/17 is okay)}

\begin{theorem}
 When $Q(T|X)$ is not calibrated, there exists an outcome function such that an IPTW estimator based on $Q$ yields an incorrect estimate of the true causal effect almost surely.
% For each uncalibrated model $Q(T|X)$, the set of data distributions $P$ for which an IPTW estimator yields incorrect  probabilities has measure one.
\end{theorem}
\begin{proof}[Example]
Consider $\mathcal{X} = \mathcal{T} = \mathcal{Y} = \{0,1\}$. Let $ P(T=1|X=0)=p_0,  P(T=1|X=1)=p_1$ and $P(X=1)=0.5$. Let us assume that $Q(T=1|X=0) = q_0$ and $Q(T=1|X=1)=q_1$. When $Q(T|X)$ is uncalibrated, $\exists i \in \{0,1\}, p_i \neq q_i$. 

If $p_1 \neq q_1$, we set $Y=X \oplus T$ ($\oplus$ is logical `AND'), and the IPTW estimator based on $Q$ obtains $\tau' =  \frac{0.5.p_1}{q_1}.$ Here, true ATE $\tau=0.5$. 

If $p_0 \neq q_0$, we set $Y=\bar{X} \oplus \bar{T}$ ($\bar{V}$ denotes logical negation of binary variable $V$), and the IPTW estimator based on $Q$ obtains $\tau' =\frac{-0.5(1-p_0)}{1-q_0}$. Here true ATE $\tau=-0.5$. 

Please note that we require the model $Q$ to be uncalibrated and not necessarily inconsistent. 
\end{proof}
% Consider a toy binary setting where $\mathcal{X} = \mathcal{T} = \{0,1\}, \mathcal{Y} = \{0,1\}^2$.

% We set $Y = (X \oplus T, \bar{X} \oplus \bar{T}) $, $ P(T=1|X=0)=p_0,  P(T=1|X=1)=p_1$ and $P(X=1)=0.5$ such that $\oplus$ is logical `AND' and $\bar{V}$ denotes logical negation of binary variable $V$. We see that true ATE is $\tau=(0.5, -0.5)$. Let us assume that $Q(T=1|X=0) = q_0$ and $Q(T=1|X=1)=q_1$. Thus, with IPTW estimator based on $Q$, we estimate $\tau' = \mathbb{E} \bigg(\frac{TY}{Q(T=1|X)} - \frac{(1-T)Y}{1-Q(T=1|X)}\bigg) = (-\frac{0.5(1-p_0)}{1-q_0}, \frac{0.5.p_1}{q_1}).$ The treatment effect $\tau'=\tau$ only when $q_0=p_0$ and $q_1=p_1$, which is not true if $Q$ is not calibrated. 

Please refer to Appendix~\ref{apdx:calibration-necessary} for a full proof. Appendix~\ref{apdx:doubly-robust} also proves the following theorem for the AIPW estimator.  
\begin{theorem}
\label{thrm-aipw-calibration-necessary}
When propensity model $Q(T|X)$ is not calibrated and the  outcome model f(X, T) is inaccurate for $X \in \{X: Q(T=1|X)=q\} \subseteq \mathcal{X}$ such that $P(T=1| Q(T=1|X')=q) \neq q$, then there exists a true outcome function such that the doubly robust AIPW estimator based on Q and f yields an incorrect estimate of true causal effects almost surely.
\end{theorem}
%Theorem~\ref{apdx:thrm-calibration-necessary} in 
 Thus, for the AIPW estimator, calibration is a necessary condition when the outcome model is inaccurate.%\sd{resolve possible confusion over requirement that model be uncalibrated and not inconsistent} %in regions where the propensity score model is uncalibrated. 

% Fortunately, calibration is also a property that is enforceable---we provide algorithms that ensure calibration in the next section. 

% and show that calibration can be enforced in a post-hoc manner without impacting initial model performance, as measured by a proper loss function.
% and show that miscalibration error decreases as $O(1/\sqrt{n})$, where $n$ is the size of a small additional recalibration dataset. Additionally, calibration can be enforced in a post-hoc manner without impacting initial model performance, as measured by a proper loss function.


% \begin{enumerate}
%     \item First, we can try to intuitively explain why a model may fail if it's not calibrated. Maybe use a small table with a small example in it.
%     \item Then, we can state a formal theorem. It could say something along the lines of: for each uncalibrated model $q$, there exist data distributions $p$ where you are not calibrated (and almost all $p$ satisfy this).
% \end{enumerate}

\subsection{Calibrated Uncertainties Improve Propensity Scoring Models}

In addition to being a necessary condition, we also identify settings in which calibration is either sufficient or prevents common failure modes of IPTW estimators. Specifically, we identify and study two such regimes: (1) accurate but over-confident propensity scoring models (e.g., neural networks \citep{guo2017calibration}); (2) high-variance IPTW estimators that take as input numerically small propensity scores.

% \subsubsection{Bounding the Error of Causal Effect Estimation Using Proper Scores}
\subsubsection{Error Bound on Causal Effect Estimates}

Our first step for studying the role of calibration is to relate the error of an IPTW estimator to the difference between a model $Q(T|X)$ and the true $P(T|X)$. 
We define $\pi_{t,y}(Q) = \sum_x P(y|x,t)\frac{P(t|x)}{Q(t|x)}P(x)$ to be the estimated probability of $y$ given $t$ with a propensity score model $Q$.
It is not hard to show that the true $Y[t] := \mathbb{E}_X Y[X,t] = \mathbb{E}_X \mathbb{E}[Y|X=x, \mathrm{do}(T=t)]$ can be written as $\sum_{y} y \pi_{y,t}(P)$ (see Appendix~\ref{apdx:calibrated-uncertainties-improve-propensity}).
Similarly, the estimate of an IPTW estimator with propensity model $Q$ in the limit of infinite data tends to $\hat Y_Q[1] - \hat Y_Q[0]$, where $\hat Y_Q[t]:= \sum_{y} y \pi_{y,t}(Q)$. We may bound the expected L1 ATE error $|Y[1] - Y[0] - (\hat Y_Q[1] - \hat Y_Q[0])|$ by $\sum_t |Y[t] - \hat Y_Q[t]| \leq \sum_t \sum_y |y| \cdot |\pi_{y,t}(P) - \pi_{y,t}(Q)|$.

% We define the error $\hat \tau_t$ of an IPTW estimator for treatment $T=t$  as 
% $(\hat\tau_t(P) - \hat\tau_t(Q))^2, $
% where $\hat \tau_t(P) = \mathbb{E}_{X, Y \sim R_t} \frac{Y}{Q(T=t|X)}$ is the expected estimate of the causal effect using a propensity score model $Q$ and $R_t \propto P(Y=1 | X, T=t) P(T=t|X) P(X)$ is the distribution of $X, Y$ conditioned on $T=t$. By definition, the optimal estimate is $\hat\tau(P)$. \sd{causal effect is $\hat\tau_1(P) - \hat\tau_0(P)$?}

Our first lemma bounds the error $|\pi_{y,t}(P) - \pi_{y,t}(Q)|$ as a function of the difference between $Q(T|X)$ and the true $P(T|X)$. A bound on the ATE error follows as a simple corollary.
\begin{lemma}
The expected error $|\pi_{y,t}(P) - \pi_{y,t}(Q)|$ induced by an IPTW estimator with propensity score model $Q$ is bounded as
\begin{equation}
|\pi_{y,t}(P) - \pi_{y,t}(Q)| \leq \mathbb{E}_{X \sim R_{y,t}}[ \ell_\chi(P_t,Q_t)^\frac{1}{2}],    
\end{equation}
where $R_{y,t} \propto P(Y=y | X, T=t) P(X)$ is a data distribution and $\ell_\chi(P_t,Q_t)= \left( 1- \frac{P(T=t|X)}{Q(T=t|X)} \right)^2$ is the $chi$-squared loss between the true propensity score and the model $Q$. 
\end{lemma}
% \sd{Are we bounding just $(\hat\tau_1(P) - \hat\tau_1(Q))^2$?  proof will work for $(\hat\tau_0(P) - \hat\tau_0(Q))^2$ too. Wouldn't the bound also depend on $\mathbb{E} (Y^2)$?}
\begin{proof}[Proof (Sketch)]
Note that
$
|\pi_{y,t}(P) - \pi_{y,t}(Q)|
 \leq \mathbb{E}_{X\sim R_{y,t}} \left| 1- \frac{P(T=t|X)}{Q(T=t|X)} \right| 
 \leq \mathbb{E}_{R_{y,t}} \ell_\chi(P_t,Q_t)^\frac{1}{2}
$
\end{proof}
See Appendix~\ref{apdx:error-bound} for the full proof. 
\begin{corollary}
\label{corollary}
Let $|y| \leq K$ for all $y\in\mathcal{Y}$.
The error of an IPTW estimator with propensity score model $Q$ is bounded by $2|\mathcal{Y}| K \max_{y,t} \mathbb{E}_{R_{y,t}} \ell_\chi(P_t,Q_t)^\frac{1}{2}.$ 
%\sd{We pull either $K$ or $\max_{y,t} \mathbb{E}_{R_{y,t}} \ell_\chi(P,Q)^\frac{1}{2}$ out of summation $\sum_{y, t}$. If we pulled both, shouldn't we multiply the bound by $|\mathcal{Y}||\mathcal{T}|=2|\mathcal{Y}|$?}
% \begin{equation}
% E  = (\hat\tau(P) - \hat\tau(Q))^2  \leq \mathbb{E}_{X, Y \sim R}[ \ell_\chi(P,Q)],    
% \end{equation}
\end{corollary}
%\sd{Reviewer points out: $\ell_\chi$(P, Y) is proper scoring rule where Y is observed data, $\ell_\chi$(P, Q) is not}
Note that $\ell_\chi$ is obtained from a proper scoring rule: it is small only if $Q$ correctly captures the probabilities in $P$. A model that accurately outputs treatment assignment, but that does not output correct probability will have a large $\ell_\chi$; conversely, when $Q=P$, the bound equals to zero and the IPTW estimator is perfectly accurate.
%
To the best of our knowledge, this is the first bound that relates the accuracy of an IPTW estimator directly to the quality of uncertainties of the probabilistic model $Q$. Corollary~\ref{apdx:corollary} in Appendix~\ref{apdx:doubly-robust} obtains a similar upper bound on error of the doubly robust AIPW estimator that is proportional to the chi-squared loss $l_X$. 

% \subsubsection{Calibration Reduces Variance of Inverse Probability Estimators}
\subsubsection{Calibration Reduces Variance of Estimators}

A common failure mode of IPTW estimators arises when the probabilities from a propensity scoring model $Q(T|X)$ are small or even equal to zero---division by $Q(T|X)$ then causes the IPTW estimator to take on very large values or be undefined. Furthermore, when $Q(T|X)$ is small, small changes in its value cause large changes in the IPTW estimator, which induces problematically high variance. % in the resulting estimates.
This failure mode also affects the doubly robust AIPW estimator, although it is more stable than the IPTW estimator. 

Here, we show that calibration can help mitigate this failure mode. If $Q$ is calibrated, then it cannot take on abnormally small values relative to $P$. Specifically, if $P(T=t|X)$ is larger than some $\delta >0$ such that $ \delta < 1/2 $, then any prediction from a calibrated estimate $Q$ of $P$ has to be larger than $\delta>0$ as well. In other words, division by small numbers cannot be a greater problem than in the true model. 

\begin{theorem}
\label{variance-reduction}
    Let $P$ be the data distribution, and suppose that $1 - \delta > P(T|X) > \delta$ for all $T, X$ and let $Q$ be a calibrated model relative to $P$. Then $1 - \delta > Q(T|X) > \delta$ for all $T, X$ as well.
\end{theorem}
\begin{proof}[Proof (Sketch)]
The proof is by contradiction. Suppose $Q(T=1|x) = q$ for some $x$ and $q < \delta$. Then because $Q$ is calibrated, of the times when we predict $q$, we have $P(T=1|Q(T=1|X) = q) = q <\delta$, which is impossible since $P(T=1|x) > \delta$ for every $x$. 

See Appendix~\ref{apdx:variance-reduction} for the full proof. 
\end{proof}




% Idea: argue that there are setting where a calibrated model improves the error bound over an uncalibrated model. For example, if the model separates classes sufficiently well.
% \subsubsection{Calibration Improves Error Bounds on Causal Effect Estimate} 
\subsubsection{Calibration Improves Error Bounds} 


We show that calibration strictly improves our $\ell_\chi$ bound on the IPTW error.
\begin{theorem}
\label{thrm:error-bound-lx}
    Let $\ell_1$ be the expected bound on the error of an uncalibrated IPTW estimator $Q_1$ in Corollary~\ref{corollary}, and let $\ell_2$ be the bound for $Q_2$, the recalibrated version of $Q_1$ with $\ell_\chi^{1/2}$ as the choice of loss $L$ to train the recalibrator. Then as the size of the calibration set $n \to \infty$ we have $\ell_2 \leq \ell_1$ with equality iff $Q_1 = Q_2$.
\end{theorem}
\vspace{-4mm}
% \sd{The proof refers to contents from subsequent section; can move it to a different location}
\begin{proof}[Proof (Sketch)]
    The part of $\ell_1, \ell_2$ that depends on $Q \in \{Q_1, Q_2\}$ is $L(Q, T) = \mathbb{E}_X \mathbb{E}_{T|X} \ell_\chi(Q(T=1|X), T)^{1/2}$.  In Section \ref{sec:algorithms}, we show that when we perform recalibration, 
    % by minimizing the loss function $L(Q, T)$, 
    it follows that $L(Q_2, T) = L(R \circ Q_1, T) \leq L(Q_1, T) + o(n)$ for a recalibrator $R$. As $n \to \infty$, $R \to B, $ where B is a Bayes optimal recalibrator. If $Q_1 \neq Q_2$, then $L(Q_2, T) \neq L(Q_1, T)$ because $L$ is strictly proper. Conversely, when $Q_1=Q_2$ clearly $\ell_1 = \ell_2$. Hence, the claim follows.
\end{proof}
Please refer to Appendix~\ref{apdx:iptw-error-bound} for a complete proof. Theorem~\ref{apdx:thrm:error-bound-lx} in Appendix~\ref{apdx:doubly-robust} proves a similar result for the AIPW estimator when the outcome model is inaccurate.
% \subsubsection{Calibration Improves the Accuracy of Causal Effect Estimation}
\subsubsection{Calibration and Accurate Causal Effect Estimation}
\label{apdx:cal-improves-accuracy}
%Calibration by itself is not sufficient to correctly estimate treatment effects. %For example, consider defining $Q(T|X)$ as the marginal $P(T)$: this $Q$ is calibrated, but cannot accurately estimate treatment effects.
If the model $Q$ is accurate enough to discriminate between different treatments (as might be the case with a powerful neural network), then calibration can ensure accurate IPTW estimates. This is a strong condition in practice. Please refer to Appendix~\ref{apdx:cal-improves-accuracy1} for a detailed theoretical analysis.  

% Our next section defines another
Below, we also show that a post-hoc recalibrated model $Q'$ has vanishing regret $\ell(Q',Q)$ with respect to a base model $Q$ and a proper loss $\ell$ (including $\ell_\chi$ used in our calibration bound).
