\section{Related Works and Discussion}

\paragraph{Calibration.} Calibration has been widely studied in machine learning~\citep{naeini2014binary, guo2017calibration, kumar2019calibration}, and applications such as meteorology~\citep{murphy1973vector,degroot1983forecasters,gneiting2005weather}, fairness~\citep{johnson2018multicalibration}, and healthcare~\citep{jiang2012calibrating}. Many of these works focus on the in-distribution (ID) setting, where models are calibrated on the same distribution that they are evaluated on.~\citet{ovadia2019uncertainty,jones2021selective} show that if we calibrate (e.g., via temperature scaling) a model ID, it still has poor uncertainties OOD.
% ~\citet{} also show that model uncertainties can be quite unreliable out-of-distribution.
However, we show that despite having poor uncertainties on traditional metrics, calibrated models can be combined effectively to mitigate ID-OOD tradeoffs.
~\citet{wald2021calibration} show that if a model is calibrated on many domains (domains $>$ no. of features) in a linear setting, then the model is calibrated (and invariant) on new domains. A key difference is that they require a large number of training domains, which may need to be annotated to ensure calibration across them, while we only require access to a single doamin.

\paragraph{Ensembling.}
Ensembling models is a common way to get an accuracy boost---typically the ensemble members are trained with a different random seed~\citep{lakshminarayanan2017simple} or augmentation~\citep{stickland2020diverse}.
In the setting where the ensemble members mostly differ by random seeds or augmentations, prior work has shown that calibrating the members of an ensemble does not help~\citep{wu2021ensemble,ovadia2019uncertainty}.
% Indeed, we find that calibration has minimal effect when we ensemble two standard, or two robust models, that are trained from different seeds.
However when we combine two very different models (standard and robust), calibration leads to clear improvements.

\paragraph{Mitigating ID-OOD tradeoffs.} Tradeoffs between ID and OOD accuracy are widely studied and prior work self-trains on large amounts of unlabeled data to mitigate such tradeoffs~\citep{raghunathan2020understanding, xie2021innout, khani2021removing}.
In contrast, our approach uses no extra unlabeled data and is a simple method where we just add up the model probabilities after a quick calibration step.
In concurrent and independent work,~\citep{wortsman2021robust} show that there \emph{exists} a way to combine a CLIP zero-shot and fine-tuned model to get good ID and OOD accuracy---however learning how to combine the models may require OOD data, which is not available. We show that the natural way to learn how to weight ensemble members---selecting the weights to optimize ID accuracy---does not get the best of both worlds.
In addition, their approach does not directly apply to settings where the standard and robust models have different architectures, such as In-N-Out~\citep{xie2021innout}.


\paragraph{Conclusion and Future Work.} 
In this paper, we show that \calens{}, a simple method of calibrating a standard and robust model only on ID data and then ensembling them, can eliminate the tradeoff between in-distribution (ID) and out-of-distribution (OOD) accuracy on a wide range of natural shifts.
We hope that this leads to more widespread use and deployment of robustness interventions.

\calens{} were competitive with prior work that used self-training, despite being simpler and not using additional unlabeled data.
However, self-training may have advantages: we believe self-training may potentially eliminate tradeoffs even in anticorrelated spurious settings---it could be interesting for future work to compare ensembling and self-training theoretically, and see if their benefits are complementary.
Additionally,~\calens{} require twice the compute of a single model (although for fairness, we compared with an ensemble of standard or robust models), while self-training gives us a single model.
One potential future direction is to see if~\calens{} can be distilled into a single model (without additional unlabeled data).
