% New abstract
We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy:
a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM.
In this paper, we find that \calens{}---where we simply ensemble the standard and robust models after calibrating on only ID data---outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy.
On eleven natural distribution shift datasets, \calens{} obtain the best of both worlds: strong ID accuracy \emph{and} OOD accuracy.
We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD: (1) standard and robust models should be calibrated (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features.
\pl{I thought part of our thing is to calibrate, so why do we need them to be calibrated first?}
\ak{Oh they need to be calibrated, which is why we calibrate them}
% We check that as predicted by the theory, ensembles do not perform well when the OOD has anti-correlated spurious features.
% \ar{I think we should lead with the observation that calibrated ensembles outperforms prior state-of-the-art, and mention that we understand this more carefully and identify three conditions under which this method works}
% \tnote{I agree with Aditi. i think not many people would know that ensembles achieve the best of both worlds. They will find the word "explain" weird (in the phrase "we explain in stylized settings"). You perhaps need to use the word "predict" if you start with theory}
% \ak{Done!}

% We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy.
% A ``robust'' classifier obtained via specialized techniques like removing spurious features often has better OOD but worse ID accuracy compared to a ``standard'' classifier trained via vanilla ERM.
% \tnote{the acronym ERM is not defined (but I personally don't feel that you have to define all acronym )}
% Towards mitigating the tradeoff, this paper investigate the ID and OOD performance of ensembling the standard and robust models.
% We identify two important conditions for ensembles to perform well on both ID and OOD: (1) standard and robust models should be calibrated (on ID data, because OOD data is unavailable), (2) OOD has no anti-correlated spurious features.
% Under these conditions, we explain in stylized settings why ensembles can get the best of both worlds: outperforming the ID accuracy of the standard model, and the OOD accuracy of the robust model.
% We test out these intuitions on 13 standard datasets, spanning multiple modalities, types of shifts, and robustness interventions.
% We test out these intuitions on 13 standard datasets, spanning multiple modalities (vision, language, and time-series), types of shifts (geography, subpopulation, style, adversarial spurious, and label shifts), and robustness interventions.
% As predicted by our analysis, calibrated ensembles achieve the best of both worlds on all 10 natural distribution shifts, but do not perform as well on adversarially synthesized spurious shifts.
% Despite its simplicity, ID calibrated ensembles outperform prior state-of-the-art.
% , self-training (which requires additional unlabeled data)
% The calibration step is important---interestingly, a common approach of tuning the ensemble weights on ID data, does not do well OOD.
% \ak{Not sure if I should talk about tuned ensembles not doing as well}

% On thirteen distribution shift datasets, we find that simply ensembling the standard and robust models by averaging their predicted probabilities performs surprisingly well---instead of interpolating between the standard and robust models' accuracies, we usually match the ID accuracy of a standard model with only a small drop in OOD accuracy compared to the robust model.
% Interestingly, calibrating the standard and robust models using only ID data improves the OOD accuracy of the ensemble and eliminates the tradeoff, giving us the best of both ID and OOD accuracy over the original models.

% Our goal is to get the best of both worlds: the strong ID accuracy of the standard model and OOD accuracy of the robust model.
% We find that a simple approach of ensembling the standard and robust models, after calibrating on only ID data, outperforms prior state-of-the-art.
