\section{Introduction}
\label{sec:intro}

% Outline:
% - Our problem is very important
%		Doing well OOD is important
% 		Robustness interventions improve OOD, but cause a drop in ID
% - Starting point: how to combine the standard & robust model ID
% 		If calibrated -> just add up their logits
% 		If don't calibrate, overparam can mess you up
% - How well do calibrated ensembles do OOD?
%		OOD confidences are bad
%		Look at stylized settings to build intuitions for when the method works
%		ID typically contains domain-specific features that are suppressed or absent OOD (example)
%		Calibrated ensembles do at least as well as the robust model here---intuitively they use the domain-specific feature when present, and use domain-general features otherwise
%		However, when spurious features are anti-correlated OOD, or we have imbalanced datasets with a large label shift, ensembles do poorly
% - Run experiments on X standard OOD datasets: calibrated ensembles are a strong method, and our intuitions check out
%		On the 11 `natural'/random distribution shift datasets (e.g. geographical shift, style shift, subpopulation shift), calibrated ensembles are BOB
%		We selected most of these datasets because they were used by prior work on mitigating accuracy tradeoffs
%		Those works proposed methods specialized for their datasets (e.g., that use large amounts of unlabeled data, or a robust fine-tuning method)
%		Calibrated ensembles achieve competitive results
%		As suggested by our intuitions, doesn't work in adversarial cases---gets accuracy in between the standard and robust models.
%		Spanning multiple modalities (vision, language, time-series), types of shifts (geographical shifts, subpopulation shifts, style shifts, worst-group, label shift), robustness interventions (pretraining, lightweight fine-tuning, language prompting, group distributionally robust optimization)
% 		Many datasets (geography shift in satellite remote sensing, subpopulation shifts, style shifts), fall in the missing features setting, and here ensembling gets the best of both the standard and robust model
% 		
%		Many of these settings have been used in prior work on mitigating tradeoffs under distribution shift
%		Prior works typically tailor a method for the task, for example self-training when a large amount of unlabeled data is available, or using a more robust fine-tuning algorithm when adapting a pretrained model
%		Interestingly, ensembles achieve comparable performance to these tailored methods
% - While the method is simple and intuitive, we found three things surprising:
% 		In a wide variety of natural distribution shifts, calibrating only on ID data and then ensembling get the best of both the standard and robust models, ID and OOD, instead of just interpolating between the standard and robust model accuracies. This is even though the models are poorly calibrated OOD, and the ensemble does not simply rely on the robust model OOD.
%		Ensembles perform as well as tailored methods such as self-training or robust fine-tuning, which require additional data and only work in certain situations
%		Tuning the ensemble weights on ID data does not work so well, but calibrated ensembles do. Some recent works (e.g. [1], [2]) suggest that ID and OOD accuracies can often be correlated, so a natural approach is to select the weight for the standard and robust model that maximizes ID validation accuracy. However, this often does not learn good weights (see Table 4) because it assigns too high a weight to the standard model, and performs poorly OOD.

Machine learning models suffer large drops in accuracy in the presence of distribution shift where the test distribution is different from the training distribution. \ar{Sentence cam be simplified - ML models are less accurate in the presence of distribution shift}As ML systems are widely deployed, it is important to train models that achieve good accuracy on unforeseen, out-of-distribution (OOD) examples.\ar{Should perhaps define what OOD is i.e. different from training distribution, in the same sentence that ``OOD'' appears in quotes} \tnote{+1, and no need to have quotes for out-of-distribution}
For example, models trained on medical data from a few hospitals should \pl{meaning of 'should' is unclear - it might be impossible; I think it's more natural to write this all under failures to generalize to OOD rather than 'should', because it's sticking more to the facts} work well when deployed broadly~\citep{zech2018radio, albadawy2018tumor}. Similarly, when predicting poverty from satellite imagery, models trained on data from a few countries should work well on all countries, particularly those where labels are scarce due to resource constraints~\citep{jean2016combining}. There has been a lot of research interest in tackling this robustness problem under various settings such as robustness to spurious correlations~\citep{heinze2017conditional, sagawa2020group}, domain generalization~\citep{arjovsky2019invariant, sun2016deep}, demographic shifts~\citep{hashimoto2018repeated, duchi2019distributionally} among others.

\begin{figure*}[th]
    \centering
    \includegraphics[width=\textwidth]{figures/executive_summary}
    \caption{
      In many settings, we have a standard model that performs better in-distribution, and a robust model that performs better out-of-distribution.
      Across \numnat{} natural distribution shifts, ID-calibrated ensembles get the best of both worlds: the strong ID accuracy of the standard model and OOD accuracy of the robust model.
      We analyze its strengths and limitations in Section~\ref{sec:analysis}---as predicted by our analysis, ID calibrated ensembles do not perform as well on adversarially synthesized shifts with ``anticorrelated'' spurious features.
      We show full experimental results and ablations in Section~\ref{sec:experiments}.
    }
    \label{fig:calibration-figs}
\end{figure*}
\pl{I think this is too strong - the best robustness interventions these days - pre-training - improve both}
\ak{Agreed that it's too strong, so I've softened. But note that pretraining typically does have the same tradeoffs, e.g., in Lisa's prefix tuning, or LP-FT. We aren't comparing whether pretraining improves over no pretraining. We're comparing the best method for ID vs. the best method for OOD, so that'd be pretraining with LP vs. pretraining with FT. And we include such experiments in this paper!}
Across many of these settings, an unfortunate tradeoff arises: robustness interventions, such as removing spurious features or lightweight fine-tuning, typically improve the OOD accuracy but cause a drop in the in-distribution (ID) accuracy on new test points from the original distribution. 
This tradeoff is a major hurdle in using the multitude of proposed methods that aim to improve OOD accuracy. In practice, most inputs are likely to be ID, so it is unsatisfactory to use a robust model that has high OOD accuracy but performs less accurately on these majority ID points.  On the other hand, standard models (trained without robustness interventions) can fail in the presence of even small shifts, and it can be dangerous to use a standard model even if OOD points are rare. In this work, we ask: \emph{is there a general strategy to harness the strengths of both the standard and robust model to achieve high accuracy both ID and OOD, without using OOD data?}
\ar{I wonder if we should also say something about whether we even expect to mitigate this tradeoff at all (for e.g. the robust features stuff in Tsipras et al and many other papers says it's fundamental) ... and also it seems like we are obviously missing the seld training references.. perhaps we can kill two birds by citing self training works that show that it is possible to mitigate the tradeoff but they require a large amount of unlabeled data. And then add the qualifier ``without any additional data from the target domain'' in the question? It's certainly less cleaner than what you have, so think about it} 
\tnote{i think a middle option is to cite Tsipras et al to demonstrate that best of both worlds should not be taken for grant, but not mention the self-training works to make the narrative simpler}

We find that \calens{}, a simple approach of first calibrating the standard and robust models on only ID data and then ensembling them, outperforms prior state-of-the-art both ID and OOD.
As illustrated in Figure~\ref{fig:calibration-figs}, across \numnat{} natural distribution shift datasets (e.g. geographical shift, style shift, subpopulation shift), \calens{} get the \emph{best of both worlds}: strong ID accuracy of the standard model and robust accuracy of the OOD models.
Averaged across these datasets, \calens{} achieve an ID accuracy of \calaccidnatural{}\% (vs. \stdaccidnatural{}\% for the standard model and \robaccidnatural{}\% for the robust model) and OOD accuracy of \calaccoodnatural{}\% (vs. \stdaccoodnatural{}\% for the standard model and \robaccoodnatural{}\% for the robust model).

We then analyze when and why ID-calibrated ensembles can get the best of the standard and robust models, under a simplifying assumption that these models provide different and independent signals for the label.
\pl{if we say independent, don't need to say they're different?} \pl{why would we expect the signals to be independent? I wouldn't expect them to be}
\ak{Yeah, they probably aren't (although works such as simplicity bias could motivate this---e.g., ERM often exclusively uses spurious features, robust methods project out spurious features and use others). I think it would be good to analyze more general settings in the future!}
If the standard and robust models are \emph{calibrated} ID, the ensembling strategy for the best ID performance is to simply add \pl{doesn't always type check, say combine?} the predictions of the two models (Proposition~\ref{prop:calibration-ensemble-optimal}).
By the same idea, if the standard and robust models were also calibrated OOD, ensembling would achieve the optimal OOD accuracy.
However, since we only have ID training data, models can only be calibrated ID and ID calibration is not sufficient for OOD calibration~\citep{ovadia2019uncertainty}.
\pl{this feels a bit complicating the story; do we need to talk about calibrated OOD in the intro?}
\ak{I'm thinking it might seem trivial if the models are calibrated OOD? Because we already said if they're calibrated then they are optimal.}

When can calibrated ensembles provide benefits even without OOD calibration?
In many natural distribution shifts, standard models pick up on predictive signals in the training data that are absent or suppressed under distribution shift---in these cases, we show that \calens{} obtain the best of both the standard and robust models OOD.
However, when spurious features become anticorrelated OOD (as is common when the distribution shift is adversarially synthesized), we show that the ensemble's OOD accuracy is in between the standard and robust models.
We empirically validate this on three adversarially synthesized shifts~\citep{sagawa2020group,jones2021selective} where the spurious signals are anticorrelated OOD.

Finally, we compare \calens{} to a number of other alternate ensembling strategies (for example, tuning the weights of the ensemble on ID validation data) and find that they do not work as well as \calens{}.
\pl{other ensembles don't calibrate? maybe make that clearer?}
\ak{doesn't make a difference for tuned ensembles (formally depends on the variant, whether we do logits or probs, but in practice doesn't matter)}

% adversarially synthesized shifts where the spurious signals are anti-correlated OOD, matching our conceptual analysis in Section~\ref{sec:analysis}. 

%% use robust models that perform worse on the vast majority of inputs? On the other han

%% Lower ID accuracy is a key obstacle to deploying robust models: many companies would not deploy a more robust model if it hurts their profits or a large majority of their users.
% For example, it can be difficult to deploy a medical system that can accurately diagnose a rare type of cancer in certain sections of the population, but is less accurate at diagnosing more common but equally deadly lung cancer.




% As a starting point, we examine how to combine a standard and robust model under a simplifying assumption that these models provide different and independent signals for the label.
% \ar{I'd cut the next two lines}
% This captures the intuition that standard models could leverage non-robust features of the input that are spuriously correlated with the label ID but not \ar{the correlation does not hold} OOD.
% On the other hand, robust models rely more on robust features that are informativate both ID and OOD.
% \ar{Under this setting (if you choose to retain the description of the setting),} If the two models are \emph{calibrated} ID, the ensembling strategy for the best ID performance is to simply add the predictions of the two models (Proposition~\ref{prop:calibration-ensemble-optimal}). However, in practice, deep networks can be miscalibrated because they can fit the data perfectly and drive the training loss to 0 by being very overconfident~\citep{mukhoti2020calibrating,bai2021dont}. We show that such miscalibration can affect the performance of the ensemble. Hence, our first takeaway is that models should be calibrated before ensembling and \calens{} get the best of both standard and robust models ID. 

% By the same idea above, if the standard and robust models were also calibrated OOD, ensembling would achieve the optimal OOD accuracy. However, since we only have ID training data, models can only be calibrated ID and ID calibration is not sufficient for OOD calibration~\citep{ovadia2019uncertainty}.
% \ar{Can have slightly better flag-planting here... suggested rephrase is ``Can calibrated ensembles provide benefits even without OOD calibration? While this is not true in general, we uncover two conditions where this is true, by analyzing a variety of stylized settings''}
% We explore when ensembling ID calibrated models can improve OOD performance by analyzing a variety of stylized settings.
% In many natural distribution shifts, standard models pick up on predictive signals in the training data that are absent or suppressed under distribution shift---in these cases, we show that \calens{} obtain the best of both standard and robust models OOD.
% % We show that if spurious features are ``missing'' or ``suppressed'' OOD, then calibrated ensembles obtain the best of both standard and robust models.
% \tnote{I think many readers won't know what suppressed sprious means}
% % Such settings are ubiquitous in several natural distribution shifts where standard models pick up on predictive signals in the training data that are absent or suppressed under distribution shift.
% However, when spurious features become anti-correlated OOD (as is common when the distribution shift is adversarially synthesized), we see that the ensemble performance is in between the standard and robust models.
% % \ar{Can ensemble ever do worse than both?} \ak{Not in our experiments, but yeah if the dataset is imbalanced then just adding the logits can do worse than both models in some theoretical settings}
% % or when there is a large label shift,

% \ar{This sentence doesn't flow, and definitely cannot be the start of the paragraph, i.e. we mention a bunch of conditions but it's not clear that they are to be satisfied by a variety of shifts} Based on the analysis of our stylized setting, \calens{} should achieve the best of both worlds on a variety of \emph{natural distribution shifts}. We test this by experimenting on 13 standard OOD datasets spanning multiple modalities (vision, language, time-series), types of shifts (geographical shifts, subpopulation shifts, style shifts, adversarial spurious), and robustness interventions (pretraining, lightweight fine-tuning, language prompting, group distributionally robust optimization).
% On the 10 natural distribution shift datasets (e.g. geographical shift, style shift, subpopulation shift), \calens{} get the best of both the standard and robust models, achieving an average ID accuracy of \calaccidnatural{}\% (vs. \stdaccidnatural{}\% for the standard model and \robaccidnatural{}\% for the robust model) and OOD accuracy of \calaccoodnatural{}\% (vs. \stdaccoodnatural{}\% for the standard model and \robaccoodnatural{}\% for the robust model).
% Calibrated ensembles also outperform ensembles of standard models, and ensembles of robust models.
% \emph{Surprisingly, without any OOD data, this simple method of \calens{} can match prior state-of-the-art approaches based on self-training (which use large amounts of unlabeled OOD data)}.
% % \ar{If you are comparing to LP-then-FT, I suggest removing it...it's new work anyways and seems a bit clunky to combine}
% % \ak{Sounds good!}
% %% We selected most of these datasets because they were used by prior work on mitigating accuracy tradeoffs.

% Next, we examine the distribution shifts where \calens{} do not mitigate the ID-OOD tradeoff. We find that this is because of adversarially synthesized shifts where the spurious signals are anti-correlated OOD, matching our conceptual analysis in Section~\ref{sec:analysis}. Finally, we compare to a number of other alternate ensembling strategies and find that they do not work as well as \calens{}.
% \ak{Maybe desribe tuned ensembles briefly, since that's a pretty common thing to do in ensembling}
% % While our general approach of ensembling is natural and intuitive, the specific method of adding the logits\ak{We actually add probabilities in our final method... That's what many ensembling papers do like Balaji's and Dmitris, but isn't the same as our theory. Adding probabilities does do a fair bit better than adding logits, which probably means we don't fully understand what's going on} of calibrated ensembles as motivated by our conceptual analysis, seems key to mitigating tradeoffs.  

\ak{We can keep this for now, but may need to cut if we need space with the new intuitions and larger tables.}
To summarize, our main contributions are:
\begin{enumerate}
  \item We revisit the classic idea of ensembling and propose a simple, general, and effective method (\calens{}) to mitigate ID-OOD accuracy tradeoffs. This method outperforms prior approaches based on self-training, despite not using any additional unlabeled data.
% and other specialized approaches that only work for specific kinds of shifts and tradeoffs. 
\item We prove that ensembles of calibrated models are optimal when the models provide independent signals about the label. However, models can only be calibrated ID from which we have training data, and ID calibration does not imply OOD calibration. In simple and stylized settings, we identify conditions under which \calens{} achieve the best of standard and robust models in terms of OOD performance. We validate these insights experimentally and find that \calens{} eliminate tradeoffs under a variety of natural distribution shifts, but can fail when there are adversarially synthesized shifts.
  \pl{this is a bit long for the contributions section, which is usually succinct summaries}
\end{enumerate}
\pl{the contributions should be much shorter and punchier like a summary;
some of what you have here, especially point 2, could be moved into the main text
}

\ar{Overall, the intro looks good to me. Similarly to the comment on the abstract, I wonder if we should lead with the experimental result that CEs match or beat self-training. I feel the method/analysis is not that exciting on its own, it's exciting because it works so well. So starting with that observation seems more compelling to me, though it breaks the usual theory -> experimetns flow of a standard paper}
%% While the method is simple and intuitive, we found three things surprising:
%% \begin{enumerate}
%% \item In a wide variety of natural distribution shifts, calibrating only on ID data and then ensembling get the best of both the standard and robust models, ID and OOD. The ensemble interpolates between the two models predictions, but the accuracy is not in between the standard and robust model accuracies. This is even though the models are poorly calibrated OOD, and the ensemble does not simply rely on the robust model OOD.
%% \item This simple method can be competitive with tailored methods such as self-training or robust fine-tuning, which require additional data and only work in certain situations.
%% \item Tuning the ensemble weights on ID data does not work so well, but calibrated ensembles do. Some recent works (e.g. [1], [2]) suggest that ID and OOD accuracies can often be correlated, so a natural approach is to select the weight for the standard and robust model that maximizes ID validation accuracy. However, this often does not learn good weights (see Table 4) because it assigns too high a weight to the standard model, and performs poorly OOD.
%% \end{enumerate}


% We consider four benchmark datasets (DomainNet, CIFAR $\to$ STL, ImageNet $\to$ ImageNet-R, and BREEDS-Entity-30) and two real world satellite remote sensing datasets (Landcover and Cropland), that have been used in prior work on robustness. Our work spans different types of robustness interventions (projecting out spurious correlations, zero-shot language prompting, freezing pretrained features), data modalities (image and time series data), and model architectures (vision transformers, deep convolutional networks, time series convolution).
% % We focus on robustness to natural distribution shifts [CITE papers] and not adversarial examples.
% Averaged across these datasets, robustness interventions increase OOD accuracy from 65\% to 77\%, but decrease ID accuracy from 88\% to 85\%.



% We first explore the natural strategy of ensembling the standard and robust models to combine their strengths. Concretely, we add the probabilities of each model to obtain a prediction with the hope that when the two models conflict, the more confident model (with larger probability) dictates the final prediction. We find that this surprisingly simple baseline already perfoms quite well---on average across all our datasets, this closes $90\%$ of the gap between the OOD of standard models, while outperforming both models ID. In other words, this simple baseline improves the OOD accuracy over standard models without hurting ID accuracy unlike previous robustness interventions. However, vanilla ensembling still leaves a gap as it underperforms the robust model OOD. 
% %% \emph{Can we get high-accuracy both in-distribution and out-of-distribution?}
% %% If we have large amounts of unlabeled data, prior work uses self-training to get the best of both worlds, a model with strong ID and OOD accuracy.
% %% Besides involving large amounts of computation, many datasets do not have additional unlabeled data for self-training.

% %% A alternative idea is to ensemble the standard and robust model---for example by adding the model probabilities [CITES] or logits before making a prediction.
% %% The hope is that each model is more confident on examples it gets correct.
% %% We find that this does fairly well, on average closing over 80\% of the gap between the standard and robust models OOD, and outperforming both methods ID.

% % Intuitively, the ensemble upweights the standard model (which does better ID), and downwe
% We find that simply calibrating both models ID (adjusting their predicted confidence to match their accuracy, on \emph{in-distribution} data) before ensembling them closes this gap.
% \emph{Calibrated ensembles get an average accuracy of 89.3\% ID and 77.9\% OOD, and outperform both the standard and robust model, ID and OOD}. The other method in the literature to alleviate robustness induced tradeoffs is self-training that uses large amount of unlabeled data~\citep{raghunathan2020understanding, xie2021innout, khani2021removing}. On the two remote-sensing datasets with additional unlabeled data, we find that calibrated ensembles match self-training on these datasets without requiring any unlabeled data. This shows that calibrated ensembles, though conceptually simple, can be highly effective in mitigating tradeoffs. 
% %% or (since we use both models symmetrically) knowledge of which model is better on which domain.
% As a sanity check, we find that the method also works when there is no tradeoff: even when the standard model (or robust model) dominates the robust model (or standard model) both ID and OOD, the calibrated ensemble accuracy matches up with the better model in both domains.

% While our method is intuitive in a way, it is also intriguing that it works so well because ensembling seems to rely on good uncertainty estimates while it is common wisdom that uncertainty estimates of deep networks are unreliable out-of-distribution~\citep{ovadia2019uncertainty}. Furthermore, it has been shown that calibration in-domain does not fix the issue of poor uncertainty estimates out-of-distribution. Indeed, on the six datasets we test on, the models fare poorly on standard uncertainty metrics OOD, even after calibrating ID. The expected calibration error (which roughly measures the difference between the model's confidence and accuracy) of the standard model across all datasets is 11\%.
% A partial answer might be that even if the models have high calibration error on average, what determines the quality of ensembling is the relative confidence between the standard and robust model on a particular data point. For example, on the remote sensing dataset (Landcover), the standard model is on average 6\% more confident in its OOD predictions than the robust model, even though the standard model is less accurate OOD---which seems bad. But at the granularity of individual points, we find the more confident model is more likely to be accurate, enabling calibrated ensembles to achieve higher OOD accuracy than both standard and robust models. While most prior work on confidence estimates has focused on a single model at a population level, our results suggest that examining individual data points and relative confidence of different models is an exciting line of future work.

% \paragraph{Comparison with regular ensembling.} Ensembling is an extremely common trick in ML to boost performance. However, when using ensembles, it is typical to have each ensemble member obtained by running the same training process with a different random seed which controls the stochastic aspects of the training algorithm like initialization, ordering of the training samples in a batch, which features are selected when using drop out and so on. The idea behind ensembling is that with different random seeds, the algorithm might converge to different solutions with decorrelated errors. In this work, we ensemble a standard and robust model which are obtain by \emph{minimizing entirely different training losses}. This small but subtle change is very crucial. Experimentally, we find that regular ensembling of two standard models does poorly OOD (11\% worse than a single robust model), and ensembling two robust models does poorly ID (3\% worse than a single standard model). For these regular ensembles, calibration has no effect---the ID and OOD accuracies stay the same.

% Why is our proposed ensembling so much better than regular ensembling? Consider the case of spurious correlations as studied in~\citep{sagawa2020overparameterization} where standard classifiers use spurious features irrespective of their random initialization. In other words, all members of the regular ensemble would have the same failure mode. The intuition of decorrelated error does not pan out with regular ensembles when standard models make systematic errors, as is common in most distribution shift settings. On the other hand, in our ensemble, we have a standard model that uses spurious features to be more accurate but also a robust model trained specifically not to use spurious features. By combining these ``diverse'' models with qualitatively different failure modes, we are able to achieve the best of both ID and OOD performance. 


%% Ensembling is a common technique, but typically each ensemble member is trained in a similar way, initialized with a different seed [CITES]---ensembling many such models leads to a slight accuracy boost.
%% For a more fair comparison, we tried ensembling two standard or two robust models---while this increases the accuracy a little, it does not mitigate the tradeoff.


% To probe further into this, we measure the \emph{relative calibration} of the ID and OOD models.
% On three out of seven of the datasets, despite being less accurate, the standard model is more confident than the robust model on OOD examples.

% % ---so if we used average confidence across the dataset to pick models, we would pick the sub-par standard model.
% Surprisingly, even on these datasets naive ensembling is able to combine these models in a per-example way to eliminate most of the tradeoff.
% % Calibration
% In summary, we show that a simple method of calibrating and then ensembling can combine the best of standard and robust models, getting high ID and OOD accuracy---and calibration plays a key role in this method. This is a general practical method to alleviate tradeoffs which can likely be combined with other innovations to completely eliminate robustness induced tradeoffs in practice. This work also raises several interesting conceptual questions regarding ensembling and uncertainty estimation.
% They seem to work despite the models being poorly 
% This leaves us with important open questions: why these methods work so well.
