\section{More information on experiments}

\subsection{Additional details on datasets}
\label{sec:more-info-datasets-appendix}

Here we describe the robustness interventions and datasets in more detail.

\textbf{Robustness interventions}:
\begin{enumerate}
	\item In-N-Out~\citep{xie2021innout}. Many datasets contain a core input $x$ (image or time series data), and metadata $z$ (e.g., location or climate data).~\citet{xie2021innout} show that using the metadata (in addition to $x$) improves accuracy in-distribution (ID), but hurts accuracy out-of-distribution.~\citet{xie2021innout} consider a standard model that takes in both the core inputs and metadata to predict the target, and a robust model that only takes in the core inputs and does some additional pretraining. We use official checkpoints from their CodaLab worksheet \url{https://worksheets.codalab.org/worksheets/0x2613c72d4f3f4fbb94e0a32c17ce5fb0}, and compare to the results tagged as ``In-N-Out'' on each dataset. They also show results after doing additional self-training on (unlabeled) OOD data, but we do not compare to this because 1. OOD data is assumed to be unavailable in our setting, and 2. if OOD unlabeled data is available, we can also start from \calens{} and do additional self-training.
% They call these the `aux-in' and `aux-out' models respectively.
	\item Lightweight fine-tuning~\citep{kumar2022finetuning}: When adapting a pretrained model to an ID dataset, typically all the model parameters are fine-tuned. Recent works show that tuning only parts of the model can often do better OOD even though the ID performance is worse~\citep{li2021prefix,houlsby2019parameter}. On four distribution shift datasets, we take checkpoints from~\citet{kumar2022finetuning} where the standard model starts from a pretrained initialization and fine-tunes all parameters on an ID dataset, and the robust model only learns the top linear `head' layer.
	\item Zero-shot language prompting:~\citet{radford2021clip} pretrain a model on a large multi-modal language and vision dataset. The model can then predict the label of an image by comparing the image embedding, with the language embedding for prompts such as `photo of an apple' or `photo of a banana'. They show that this zero-shot language prompting approach (robust model) can be much more accurate OOD than the traditional method of fine-tuning the entire model (standard model), although ID accuracy of the robust model is worse. We use model checkpoints and datasets from~\citet{radford2021clip}.
	\item Group distributionally robust optimization (DRO)~\citep{sagawa2020group}: Standard ERM models often latch on to spurious correlations in a dataset, such as image background color, or the occurrence of certain words in a sentence. Group DRO essentially upweights examples where this spurious correlation is not present.
	The original formulation in~\citet{sagawa2020group} assumes the spurious correlations are annotated, but newer variants~\citep{liu2021jtt} can work even without these annotations.
	\item CORAL~\citep{sun2016deep} aims to align feature representations across different domains, by penalizing differences in the means and covariances
	of the feature distributions.
	The hope is that this generalizes better to OOD domains.
 \end{enumerate}

We consider three types of \natshifts{} (geography shifts, subpopulation shifts, style shifts), and we also consider adversarial spurious shifts.

\textbf{Geography shifts.} In geography shifts the ID data comes from some locations, and the OOD data comes from a different set of locations. One motivation is that in many developing areas training data may be unavailable because of monetary constraints~\citep{jean2016combining}.
\begin{enumerate}
	\item \textbf{LandCover}~\citep{russwurm2020meta}: The goal is to classify a satellite images into one of 6 land types (e.g., "grassland", "savannas"). The ID data contains images from outside Africa, and the OOD data consists of images from Africa. We take model checkpoints from~\citet{xie2021innout} where they use the In-N-Out intervention---the core feature $x$ is time series data measured by Nasa's MODIS satellite, and the spurious metadata $z$ consists of climate data (e.g., temperature) at that location. We use the ID and OOD dataset splits defined by~\citet{xie2021innout}.
	\item \textbf{Cropland}~\citep{wang2020weakly}: The goal is to predict whether a satellite image is of a cropland or not. The ID dataset contains images from Iowa, Missouri, and Illinois, and the OOD dataset contains images from Indiana and Kentucky. We take model checkpoints from~\citet{xie2021innout} where they use the In-N-Out intervention---the core feature $x$ is an RGB satellite image, and the spurious metadata $z$ consists of location coordinates and vegetation bands. We use the ID and OOD dataset splits defined by~\citet{xie2021innout}.
	\item \textbf{iWildCam}~\citep{beery2020iwildcam,koh2021wilds}: The goal is to classify the species of an animal given a photo taken by a camera placed in the wild (e.g., in a forest). The ID dataset consists of photos taken by over 200 cameras, and the OOD dataset consists of photos taken by held-out cameras. We use the splits by~\citet{koh2021wilds}. We take model checkpoints from~\citet{koh2021wilds}, where the standard model is trained via standard empirical risk minimization (ERM), and the robust model is trained via CORAL. The model checkpoints were taken from \url{https://worksheets.codalab.org/worksheets/0x036017edb3c74b0692831fadfe8cbf1b}.
\end{enumerate}
\ak{Optionally move the discussion of what's core and spurious to the Appendix}

\textbf{Subpopulation shifts.} In subpopulation shifts, the ID data contains a few sub-categories (e.g., black bear and sloth bear), and the OOD data contains different sub-categories (e.g., brown bears and polar bears) or the same parent category (e.g., bears). For both datasets below, we take model checkpoints from~\citet{kumar2022finetuning} where they use the lightweight fine-tuning intervention, starting from a MoCo-v2 ResNet-50 model pretrained on unlabeled ImageNet images. The datasets are from~\citet{santurkar2020breeds}.
\begin{enumerate}
	\item \textbf{Living-17}~\citep{santurkar2020breeds}: the goal is to classify an image as one of 17 animal categories such as ``bear'' - the ID dataset contains images of black bears and sloth bears and the OOD dataset has images of brown bears and polar bears. 
	\item \textbf{Entity-30}~\citep{santurkar2020breeds}: similar to Living-17, except the goal is to classify an image as one of 30 entity categories such as ``food'', ``motor vehicle'', and ``index''.
\end{enumerate}
\ak{Optionally move the discussion of what checkpoints we used to the Appendix.}

\textbf{Style shifts.} In style shifts, the ID data contains data in a certain style (e.g., sketches), and the OOD data contains data in a different style (e.g., real photos, renditions). 
\begin{enumerate}
	\item \textbf{DomainNet}~\citep{peng2019moment}: a standard domain adaptation dataset. Here, our ID dataset contains ``sketch'' images (e.g., drawings of apples, elephants, etc), and the OOD dataset contains ``real'' photos of the same categories. We take model checkpoints from~\citet{kumar2022finetuning} where they use the lightweight fine-tuning intervention, starting from a CLIP ResNet-50 model.
	\item \textbf{CelebA}~\citep{liu2015deep}: the goal is to classify a portrait of a face as ``male'' or ``female'' - the ID dataset contains images of people without hats, and the OOD dataset contains images of people wearing hats (some facial features might be ``suppressed'' or ``missing'' with hats). We take model checkpoints from~\citet{xie2021innout} where they use the In-N-Out intervention---the core feature $x$ is the RGB image, and the spurious metadata $z$ consists of 7 attribute tags annotated in the dataset (e.g., presence of makeup, beard).
	\item \textbf{CIFAR->STL}: standard domain adaptation dataset~\citep{french2018selfensembling}, where the ID is CIFAR-10~\citep{krizhevsky2009learningmultiple}, and the OOD is STL~\citep{coates2011stl10}. The task is to classify an image into one of 10 categories such as ``dog'', ``cat'', or ``airplane''. We take model checkpoints from~\citet{kumar2022finetuning} where they use the lightweight fine-tuning intervention, starting from a MoCo-v2 ResNet-50 model pretrained on unlabeled ImageNet images.
	\item \textbf{ImageNet}~\citep{russakovsky2015imagenet}: a large scale dataset where the goal is to classify an image into one of 1000 categories. We use the zero-shot language prompting intervention using a CLIP ViT-B/16 vision transformer model. We evaluate on 3 standard OOD datasets: \textbf{ImageNetV2}~\citep{recht2019doimagenet},\textbf{ImageNet-R}~\citep{hendrycks2020many}, and \textbf{ImageNet-Sketch}~\citep{wang2019learningrobust}. \ak{The zero-shot model checkpoint is taken from CLIP paper, and the fine-tuned model checkpoint is taken from LP-FT paper}
\end{enumerate}

\textbf{Adversarial spurious shifts.} In adversarial spurious shifts, the ID dataset contains a feature that is correlated with a label, but this correlation is flipped OOD.
% For example, waterbirds is explicitly constructed so that ``water'' backgrounds are correlated with ``waterbird'' labels in the ID, but anti-correlated OOD.
\begin{enumerate}
	\item \textbf{Waterbirds}~\citep{sagawa2020group}: The goal is to classify an image as a ``waterbird'' or ``landbird''. The dataset is synthetically constructed to have adversarially spurious features: ``water'' backgrounds are correlated with ``waterbird'' labels in the ID, but anticorrelated OOD. We use checkpoints from~\citet{jones2021selective} where they use the group DRO intervention.
	\item \textbf{MNLI}~\citep{williams2018broad}: The goal is to predict whether a hypothesis is entailed, contradicted by, or neutral to an associated premise. We use the splits in~\citet{sagawa2020group}---they partition the dataset so that in-distribution ``negation'' words ``nobody'', ``no'', ``never'', and ``nothing'' are correlated with the contradiction label, however in the OOD dataset these words are anticorrelated with the contradiction label. We use checkpoints from~\citet{jones2021selective} where they use the group DRO intervention.
	\item \textbf{CivilComments}~\citep{borkan2019nuanced}: The goal is to predict whether a comment is toxic or not. We use the splits in~\citet{sagawa2020group}---they partition the dataset where in the ID split mentions of a Christian identity are correlated with non-toxic comments, but in the OOD split mentions of a Christian identity are correlated with a toxic comment. We use checkpoints from~\citet{jones2021selective} where they use the group DRO intervention. CivilComments is also used in~\citet{koh2021wilds}.
\end{enumerate} 


\subsection{Per-dataset results on ensembling ablations}
\label{sec:per-dataset-ensemble-ablations}

In Section~\ref{sec:experiments-how-ensemble} we ablated calibrated ensembles with ``tuned'' ensembles where the ensemble weights are tuned on in-distribution validation data, and with vanilla ensembles.
\ak{TODO: explain what tuned ensembles are more formally (for both probs and logits). Explain prob vs logit ensembling more formally. Explain that we took the best of prob and logit for tuned ensembles}
Here, we show per-dataset results both ID (Table~\ref{tab:id_tuned}) and OOD (Table~\ref{tab:ood_tuned}).

In Section~\ref{sec:experiments-how-ensemble}, 
We also compared calibrated ensembles (of one standard and one robust model) with ensembles of two standard models, and ensembles of two robust models, where for a fair comparison all models are calibrated.
We ran this ablation on 6 of the \numtotal{} datasets (Entity-30, DomainNet, CIFAR$\to$STL, Living-17, Landcover, Cropland, and CelebA) because it requires multiple standard and multiple robust models, which were not available or very expensive to run on large datasets like ImageNet.
Calibrated ensembles get an average ID accuracy of \calaccidSeven{}\% (vs. \robrobaccidSeven{}\% for a robust-robust ensemble and \stdstdaccidSeven{}\% for a standard-standard ensemble), and an average OOD accuracy of \calaccoodSeven{}\% (vs. \robrobaccoodSeven{}\% for a robust-robust ensemble and \stdstdaccoodSeven{}\% for a standard-standard ensemble). We show per-dataset results in Table~\ref{tab:id_std_std_rob_rob} (ID) and Table~\ref{tab:ood_std_std_rob_rob} (OOD).
We show per-dataset results both ID (Table~\ref{tab:id_std_std_rob_rob}) and OOD (Table~\ref{tab:ood_std_std_rob_rob}).


\begin{table*}[t]
\caption{
\emph{ID} accuracies: The in-distribution accuracies of calibrated ensembles, tuned ensembles, and vanilla ensembles are very close (within confidence intervals), so any of these methods are acceptable if we are looking at in-distribution accuracy. However, they perform quite differently when it comes to OOD accuracy (Table~\ref{tab:ood_tuned}).
}
\label{tab:id_tuned}
\vskip 0.15in
\begin{center}

\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Logits & 93.7 (0.1) & 89.3 (0.6) & \textbf{97.3 (0.1)} & \textbf{97.1 (0.2)} & \textbf{77.4 (0.1)} & 95.5 (0.1) & 93.4 (0.6)\\
Probs & \textbf{93.7 (0.1)} & 89.1 (0.4) & \textbf{97.3 (0.1)} & \textbf{97.1 (0.2)} & \textbf{77.4 (0.2)} & 95.5 (0.1) & 93.4 (0.6)\\
Tuned Logits & \textbf{93.8 (0.0)} & \textbf{91.3 (0.2)} & \textbf{97.4 (0.1)} & 97.1 (0.1) & \textbf{77.3 (0.4)} & \textbf{95.6 (0.1)} & \textbf{94.8 (0.2)}\\
Tuned Probs & 93.8 (0.1) & 90.6 (0.7) & \textbf{97.4 (0.1)} & \textbf{97.2 (0.1)} & 77.1 (0.3) & 95.5 (0.1) & \textbf{95.0 (0.2)}\\
Calibrated Logits & 93.7 (0.1) & \textbf{91.1 (0.4)} & 97.2 (0.1) & \textbf{97.2 (0.2)} & 77.2 (0.2) & \textbf{95.6 (0.1)} & \textbf{94.5 (0.5)}\\
Calibrated Probs & 93.7 (0.1) & \textbf{91.2 (0.7)} & 97.2 (0.1) & \textbf{97.2 (0.2)} & 77.2 (0.2) & \textbf{95.6 (0.1)} & \textbf{94.5 (0.5)}\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccc}
\toprule
 & ImageNet & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Logits & 82.1 (-) & \textbf{84.2 (-)} & \textbf{82.9 (-)} & 90.1 (-) & 90.4 (-)\\
Probs & 82.1 (-) & 83.9 (-) & \textbf{82.9 (-)} & 90.1 (-) & 90.4 (-)\\
Tuned Logits & \textbf{82.7 (-)} & \textbf{84.1 (-)} & \textbf{83.0 (-)} & \textbf{93.2 (-)} & \textbf{92.7 (-)}\\
Tuned Probs & 82.3 (-) & 83.9 (-) & \textbf{83.0 (-)} & \textbf{93.2 (-)} & \textbf{92.6 (-)}\\
Calibrated Logits & 82.0 (-) & \textbf{84.3 (-)} & 82.8 (-) & 92.9 (-) & 91.4 (-)\\
Calibrated Probs & 82.0 (-) & 84.0 (-) & \textbf{82.8 (-)} & 92.9 (-) & 91.4 (-)\\
\bottomrule
\end{tabular}

% \begin{tabular}{ccccccc}
% \toprule
%  & Ent30 & DomNet & CIFAR10 & Land & Crop & ImNet\\
% \midrule
% Logits & 93.7 (0.1) & 89.3 (0.6) & 97.3 (0.1) & 77.4 (0.1) & 95.5 (0.1) & 80.9 (-)\\
% Probs & 93.7 (0.1) & 89.1 (0.4) & 97.3 (0.1) & 77.4 (0.2) & 95.5 (0.1) & 81.0 (-)\\
% Tuned Logits & 93.8 (0.0) & 91.3 (0.2) & 97.4 (0.1) & 77.3 (0.4) & 95.6 (0.1) & 81.7 (-)\\
% Tuned Probs & 93.8 (0.1) & 90.6 (0.7) & 97.4 (0.1) & 77.1 (0.3) & 95.5 (0.1) & 81.3 (-)\\
% Calibrated Logits & 93.7 (0.1) & 91.1 (0.4) & 97.2 (0.1) & 77.2 (0.2) & 95.6 (0.1) & 81.0 (-)\\
% Calibrated Probs & 93.7 (0.1) & 91.2 (0.7) & 97.2 (0.1) & 77.2 (0.2) & 95.6 (0.1) & 81.1 (-)\\
% \bottomrule
% \end{tabular}
\end{center}
\vskip -0.1in
\end{table*}


\begin{table*}[t]
\caption{
\emph{OOD} accuracies: calibrated ensembles outperform vanilla ensembles and even tuned ensembles where the combination weights are tuned to maximize in-distribution accuracy. Averaged across the datasets, calibrated ensembles get an OOD accuracy of \calaccood\%, while tuned ensembles get an accuracy of \tunedaccood\%. The in-distribution accuracies of the methods are very close (within 0.2\% of each other).
}
\label{tab:ood_tuned}
\vskip 0.15in
\begin{center}

\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Logits & \textbf{64.9 (0.3)} & 75.7 (1.2) & 87.3 (0.2) & \textbf{81.8 (0.4)} & \textbf{60.5 (0.8)} & \textbf{90.9 (0.2)} & \textbf{76.9 (0.9)}\\
Probs & 64.6 (0.4) & 78.7 (1.3) & 87.2 (0.2) & \textbf{81.8 (0.4)} & 59.5 (1.0) & \textbf{90.9 (0.2)} & \textbf{76.9 (0.9)}\\
Tuned Logits & \textbf{64.6 (0.6)} & 86.3 (0.6) & 85.7 (0.9) & 80.8 (0.7) & 58.7 (1.2) & \textbf{87.3 (5.7)} & \textbf{77.5 (1.3)}\\
Tuned Probs & 62.8 (0.7) & \textbf{86.9 (0.2)} & 85.0 (1.3) & 81.6 (0.5) & 58.7 (2.2) & \textbf{86.8 (5.5)} & \textbf{77.6 (1.7)}\\
Calibrated Logits & \textbf{65.0 (0.4)} & 84.4 (0.3) & \textbf{87.5 (0.2)} & \textbf{82.0 (0.4)} & \textbf{61.2 (0.8)} & \textbf{91.3 (0.8)} & \textbf{77.6 (1.2)}\\
Calibrated Probs & \textbf{64.7 (0.5)} & 86.1 (0.2) & 87.3 (0.2) & \textbf{82.2 (0.6)} & \textbf{60.8 (0.8)} & \textbf{91.3 (0.8)} & \textbf{77.6 (1.2)}\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccccc}
\toprule
 & ImNet-R & ImNet-V2 & ImNet-Sk & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Logits & 73.1 (-) & \textbf{73.7 (-)} & 52.1 (-) & \textbf{66.2 (-)} & 73.1 (-) & 66.9 (-) & \textbf{76.0 (-)}\\
Probs & 77.5 (-) & 73.4 (-) & 52.0 (-) & 65.3 (-) & 72.4 (-) & 66.9 (-) & \textbf{76.0 (-)}\\
Tuned Logits & 64.7 (-) & \textbf{73.6 (-)} & 47.9 (-) & 66.0 (-) & 68.0 (-) & \textbf{88.1 (-)} & 60.3 (-)\\
Tuned Probs & 64.0 (-) & 72.6 (-) & 45.5 (-) & 65.3 (-) & 69.4 (-) & \textbf{88.1 (-)} & 61.5 (-)\\
Calibrated Logits & 73.7 (-) & \textbf{73.6 (-)} & \textbf{52.3 (-)} & \textbf{66.1 (-)} & \textbf{73.6 (-)} & 81.1 (-) & 71.8 (-)\\
Calibrated Probs & \textbf{77.9 (-)} & 73.2 (-) & \textbf{52.3 (-)} & \textbf{66.3 (-)} & 73.2 (-) & 81.1 (-) & 71.8 (-)\\
\bottomrule
\end{tabular}

% \begin{tabular}{ccccccc}
% \toprule
%  & Ent30 & DomNet & STL & Land & Crop & ImNet-R\\
% \midrule
% Vanilla Logits & \textbf{64.9 (0.3)} & 75.7 (1.2) & \textbf{87.3 (0.2)} & \textbf{60.5 (0.8)} & \textbf{90.9 (0.2)} & 72.2 (-)\\
% Vanilla Probs & \textbf{64.6 (0.4)} & 78.7 (1.3) & 87.2 (0.2) & 59.5 (1.0) & \textbf{90.9 (0.2)} & \textbf{77.4 (-)}\\
% Tuned Logits & \textbf{64.6 (0.6)} & 86.3 (0.6) & 85.7 (0.9) & 58.7 (1.2) & \textbf{87.3 (5.7)} & 63.1 (-)\\
% Tuned Probs & 62.8 (0.7) & \textbf{86.9 (0.2)} & 85.0 (1.3) & 58.7 (2.2) & \textbf{86.8 (5.5)} & 63.8 (-)\\
% Calibrated Logits & \textbf{65.0 (0.4)} & 84.4 (0.3) & \textbf{87.5 (0.2)} & \textbf{61.2 (0.8)} & \textbf{91.3 (0.8)} & 71.7 (-)\\
% Calibrated Probs & \textbf{64.7 (0.5)} & 86.1 (0.2) & \textbf{87.3 (0.2)} & \textbf{60.8 (0.8)} & \textbf{91.3 (0.8)} & \textbf{77.1 (-)}\\
% \bottomrule
% \end{tabular}
\end{center}
\vskip -0.1in
\end{table*}


%%%%%%%%%%%%%% Std-std, rob-rob, vs. calibrated ensemble (ID) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{ID} accuracies: Calibrated ensembles (one standard and one robust model) achieve comparable or better performance to Standard ensembles (ensemble of two calibrated standard models) and Robust ensembles (ensemble of two calibrated robust models).
}
\label{tab:id_std_std_rob_rob}
\vskip 0.15in
\begin{center}

\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & CelebA\\
\midrule
Std Ensemble & \textbf{94.0 (0.0)} & 86.3 (0.4) & \textbf{97.7 (0.1)} & \textbf{97.0 (0.3)} & \textbf{77.9 (0.1)} & 91.7 (0.4)\\
Rob Ensemble & 90.9 (0.2) & 89.3 (0.3) & 92.0 (0.0) & \textbf{97.1 (0.1)} & 73.4 (0.2) & \textbf{95.2 (0.4)}\\
Cal ensemble & 93.7 (0.1) & \textbf{91.2 (0.7)} & 97.2 (0.1) & \textbf{97.2 (0.2)} & 77.2 (0.2) & 94.5 (0.5)\\
\bottomrule
\end{tabular}

\end{center}
\vskip -0.1in
\end{table*}


%%%%%%%%%%%%%% Std-std, rob-rob, vs. calibrated ensemble (OOD) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{OOD} accuracies: Calibrated ensembles (one standard and one robust model) achieve comparable or better performance to Standard ensembles (ensemble of two calibrated standard models) and Robust ensembles (ensemble of two calibrated robust models).
}
\label{tab:ood_std_std_rob_rob}
\vskip 0.15in
\begin{center}

\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & CelebA\\
\midrule
Std Ensemble & 61.7 (0.2) & 57.9 (0.2) & 83.5 (0.2) & 78.6 (0.4) & 57.5 (0.7) & 73.7 (1.1)\\
Rob Ensemble & 63.8 (0.4) & \textbf{87.5 (0.1)} & 85.1 (0.1) & \textbf{82.4 (0.1)} & \textbf{60.5 (1.4)} & \textbf{78.0 (0.6)}\\
Cal ensemble & \textbf{64.7 (0.5)} & 86.1 (0.2) & \textbf{87.3 (0.2)} & \textbf{82.2 (0.6)} & \textbf{60.8 (0.8)} & \textbf{77.6 (1.2)}\\
\bottomrule
\end{tabular}

\end{center}
\vskip -0.1in
\end{table*}


\subsection{Per-dataset results on calibration and confidence}
\label{sec:per-dataset-calibration-appendix}

\textbf{Relative confidence can be incorrect.}
We measure the confidence of a model $f$ on a distribution $P$ as $\mbox{conf}(f, P) = \E_{x \sim P}[\max_i f(x)_i]$.
Even if the models are not calibrated OOD, one intuitive intuition for why calibrated ensembles work is that that robust model has higher confidence OOD, so that the ensemble primarily uses the (more accurate) robust model's predictions OOD.
\ak{maybe simplify. even if the robust model is more confident on average that doesn't mean we get best of both worlds. so maybe just directly say: on OOD data, the robust model is typically more confident than the standard model, which is reasonable since the robust model is also more accurate.}
However, on the remote sensing dataset Landcover we find that the robust model is 6\% \emph{less confident} on OOD data than the standard model even though the robust model is 5\% \emph{more accurate} OOD than the standard model.
Interestingly, calibrated ensembles are able to combine the models in a more fine-grained way to get the best of both worlds, which is captured in our stylized setting in Section~\ref{sec:analysis}.
We show the average confidence of the standard and robust models for each dataset ID (Table~\ref{tab:id_conf}) and OOD (Table~\ref{tab:ood_conf}).

\textbf{Per-dataset results for ECE.}
In Section~\ref{sec:experiments_models_miscalibrated}, we talked about the ECE of the standard and robust models \emph{after calibrating on ID data}.
Here we show the results for each dataset ID (Table~\ref{tab:id_ece}) and OOD (Table~\ref{tab:ood_ece}).
We also show the ECE of the standard and robust models \emph{before calibrating on ID data}, on ID (Table~\ref{tab:id_ece_precalibration}) and on OOD (Table~\ref{tab:ood_ece_precalibration}).


%%%%%%%%%%%%%% ECE, post-calibration (ID) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{ID} ECE: The expected calibration error (ECE) of the standard and robust models on ID test data, after post-calibration in ID validation data.
The ID calibration errors are low---note that we only use 500 examples to temperature scale, so for ImageNet we have fewer examples than classes for post-calibration, but the models are still fairly well calibrated.
}
\label{tab:id_ece}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Cal. Standard & 0.7 (0.1) & 2.0 (0.3) & 0.8 (0.2) & 1.3 (0.2) & 1.1 (0.5) & 1.4 (0.3) & 2.7 (0.4)\\
Cal. Robust & 1.1 (0.4) & 2.2 (0.2) & 1.3 (0.2) & 1.8 (0.0) & 1.7 (0.3) & 3.5 (0.2) & 1.2 (0.3)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccc}
\toprule
 & ImageNet & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Cal. Standard & 1.2 (-) & 3.6 (-) & 2.2 (-) & 1.2 (-) & 1.2 (-)\\
Cal. Robust & 2.3 (-) & 1.3 (-) & 2.5 (-) & 0.5 (-) & 8.1 (-)\\
\bottomrule
\end{tabular}
\end{center}
\vskip -0.1in
\end{table*}


%%%%%%%%%%%%%% ECE, post-calibration (OOD) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{OOD} ECE: The expected calibration error (ECE) of the standard and robust models on OOD test data, after calibrating on ID validation data.
The calibration errors here are high, especially compared to the ID calibration errors in Table~\ref{tab:id_ece}.
}
\label{tab:ood_ece}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Cal. Standard & 15.4 (0.8) & 13.6 (1.5) & 5.6 (1.1) & 11.4 (0.3) & 16.4 (0.8) & 7.4 (4.8) & 11.5 (1.0)\\
Cal. Robust & 14.3 (1.5) & 5.5 (0.5) & 8.2 (0.0) & 8.7 (0.2) & 6.5 (1.1) & 5.0 (0.3) & 14.0 (1.4)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccccc}
\toprule
 & ImNet-R & ImNet-V2 & ImNet-Sk & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Cal. Standard & 5.4 (-) & 4.0 (-) & 10.1 (-) & 3.2 (-) & 13.2 (-) & 17.7 (-) & 23.3 (-)\\
Cal. Robust & 4.0 (-) & 4.9 (-) & 5.1 (-) & 2.4 (-) & 4.2 (-) & 5.5 (-) & 6.3 (-)\\
\bottomrule
\end{tabular}

\end{center}
\vskip -0.1in
\end{table*}




%%%%%%%%%%%%%%% These should certainly go into the Appendix. %%%%%%%%%%%%%


%%%%%%%%%%%%%% Confidence on ID data (after calibration) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{ID} Confidences: The confidence of the standard and robust models on ID test data (after calibrating on ID data).
The standard model is typically more confidence than the robust model, which is reasonable since the standard model is also typically more accurate.
There are a few exceptions such as DomainNet, CelebA, and WaterBirds where the standard model is less confident than the robust model, but the standard model is also less accurate in these cases, so this is also reasonable.
}
\label{tab:id_conf}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Cal. Standard & 93.1 (0.3) & 83.7 (0.4) & 96.9 (0.6) & 97.0 (0.2) & 76.5 (0.9) & 95.5 (0.4) & 91.7 (0.6)\\
Cal. Robust & 89.9 (0.4) & 89.6 (0.1) & 91.0 (0.1) & 96.0 (0.1) & 71.3 (0.5) & 94.9 (0.5) & 94.7 (0.2)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccc}
\toprule
 & ImageNet & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Cal. Standard & 82.1 (-) & 82.1 (-) & 82.6 (-) & 87.9 (-) & 93.6 (-)\\
Cal. Robust & 68.1 (-) & 82.3 (-) & 81.9 (-) & 93.2 (-) & 87.0 (-)\\
\bottomrule
\end{tabular}
\end{center}
\vskip -0.1in
\end{table*}




%%%%%%%%%%%%%% Confidence on OOD data (after calibration) %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{OOD} Confidences. The confidence of the standard and robust models on OOD test data (after calibrating on ID data).
The robust model is usually more confident than the standard model, which is reasonable since the robust model is also typically more accurate.
However, Landcover is a noticable exception: the robust model is less confident OOD, even though it is more accurate (see Table~\ref{tab:ood_results}).
}
\label{tab:ood_conf}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Cal. Standard & 76.1 (0.8) & 68.9 (1.5) & 87.8 (1.2) & 89.2 (0.5) & 72.0 (1.9) & 92.8 (1.0) & 85.5 (1.5)\\
Cal. Robust & 77.5 (0.4) & 92.6 (0.4) & 93.3 (0.1) & 90.8 (0.2) & 66.0 (0.6) & 94.1 (0.4) & 90.1 (0.1)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccccc}
\toprule
 & ImNet-R & ImNet-V2 & ImNet-Sk & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Cal. Standard & 57.8 (-) & 75.5 (-) & 50.6 (-) & 59.1 (-) & 77.0 (-) & 78.1 (-) & 80.1 (-)\\
Cal. Robust & 74.0 (-) & 64.2 (-) & 53.2 (-) & 65.1 (-) & 79.7 (-) & 92.5 (-) & 80.4 (-)\\
\bottomrule
\end{tabular}
\end{center}
\vskip -0.1in
\end{table*}



%%%%%%%%%%%%%% ECE, before calibration (ID) %%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%% NOTE: this is before calibration %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{ID} ECE. The expected calibration error (ECE) of the standard and robust models on ID test data, \emph{before calibration} (the key difference from Table~\ref{tab:id_ece} is that this is before calibration).
We can see that calibration on ID substantially reduces the ECE on ID data (see Table~\ref{tab:id_ece})
}
\label{tab:id_ece_precalibration}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Standard & 1.0 (0.1) & 8.5 (0.7) & 1.2 (0.1) & 1.2 (0.1) & 6.7 (1.2) & 1.5 (0.3) & 5.9 (0.5)\\
Robust & 1.1 (0.3) & 5.8 (1.3) & 1.1 (0.2) & 3.4 (0.4) & 1.3 (0.1) & 3.5 (0.1) & 1.8 (0.2)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccc}
\toprule
 & ImageNet & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Standard & 2.2 (-) & 10.9 (-) & 9.0 (-) & 8.2 (-) & 3.7 (-)\\
Robust & 2.4 (-) & 2.8 (-) & 8.2 (-) & 14.8 (-) & 10.2 (-)\\
\bottomrule
\end{tabular}
\end{center}
\vskip -0.1in
\end{table*}


%%%%%%%%%%%%%% ECE, before calibration (OOD) %%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%% NOTE: this is before calibration %%%%%%%%%%%%%%%%%%%
\begin{table*}[t]
\caption{
\emph{OOD} ECE: The expected calibration error (ECE) of the standard and robust models on OOD test data, \emph{before calibration} (the key difference from Table~\ref{tab:ood_ece} is that this is before calibration).
The calibration errors here are higher than the ID calibration errors in Table~\ref{tab:id_ece_precalibration}.
Comparing with Table~\ref{tab:ood_ece} (which is after calibration on ID data), we see that calibrating ID does help OOD calibration a little, although the models still remain miscalibrated OOD.
}
\label{tab:ood_ece_precalibration}
\vskip 0.15in
\begin{center}
\begin{tabular}{cccccccc}
\toprule
 & Ent30 & DomNet & CIFAR10 & Liv17 & Land & Crop & CelebA\\
\midrule
Standard & 19.1 (0.3) & 29.5 (0.5) & 10.1 (0.3) & 11.7 (0.4) & 24.7 (1.5) & 8.3 (4.3) & 17.6 (0.5)\\
Robust & 14.3 (1.6) & 1.8 (0.8) & 8.4 (0.3) & 6.8 (0.2) & 7.1 (1.3) & 8.4 (0.7) & 12.7 (0.7)\\
\bottomrule
\end{tabular}
\vspace{1.2mm}
\newline
\begin{tabular}{cccccccc}
\toprule
 & ImNet-R & ImNet-V2 & ImNet-Sk & iWildCam & MNLI & Waterbirds & Comments\\
\midrule
Standard & 7.9 (-) & 6.1 (-) & 13.3 (-) & 19.5 (-) & 22.7 (-) & 31.8 (-) & 30.0 (-)\\
Robust & 3.9 (-) & 5.2 (-) & 5.2 (-) & 5.3 (-) & 10.3 (-) & 10.4 (-) & 9.9 (-)\\
\bottomrule
\end{tabular}
\end{center}
\vskip -0.1in
\end{table*}
