\section{Intuitions and analysis}
\label{sec:analysis}

In this section, we build basic intuitions for when and why \calens{} can get the best of both worlds (good ID accuracy of $\fstd$ and OOD accuracy of $\frob$), even without using any OOD data.
We first define a stylized setting, and then analyze the ID performance in Section~\ref{sec:analysis_id} and OOD performance in Section~\ref{sec:analysis_ood}.
% The key selling point is the strong empirical performance in Section~\ref{sec:experiments}.
\ak{Maybe say analyzing ensembling is a big open problem in theoretical machine learning, which maybe makes our very stylized setting not too terrible}
% Our goal is to get the best of both worlds (good ID accuracy of $\fstd$ and OOD accuracy of $\frob$) using only ID validation data.
% Here, we consider a stylized setting to build up to a principled approach.
% We analyze the ID accuracy in Section~\ref{sec:analysis_id} and OOD accuracy in Section~\ref{sec:analysis_ood}.
% We describe our final method in Section~\ref{sec:methods}.
% Standard models are trained with empirical risk minimization on training data, while robust models are trained with a robustness intervention such as projecting out spurious features [cites], linear probing a pretrained model [cites], zero-shot language prompting [cites], and distributionally robust optimization [cites].
% and evaluate on real datasets in Section~\ref{sec:experiments}.

% we assume $\fstd$ relies on spurious features of the input (that) 
% Recall that the goal of this work is to combine the strengths of a standard model $\fstd$ (with lower ID error) and a robust $\frob$ (with lower OOD error) in order to achieve the best of both worlds---low ID and low OOD error. To do so, we need to understand how $\fstd$ and $\frob$ differ.
% To get the best of $\fstd$ and $\frob$, we need to understand the relationship between $\fstd$ and $\frob$.
\textbf{Diverse features.} An intuitive and illustrative conceptual setting is the following: we assume inputs have some robust features (that are predictive both ID and OOD) and some spurious features (that are only predictive ID). $\fstd$ relies on the spurious features while $\frob$ relies on the robust features, both of which provide independent signals on the label.

\begin{assumption}
  We assume that $\frob$ and $\fstd$ have diverse \pl{/independent?} features \pl{not clear what a feature is - seems like it's not $x$; this really needs to be clear} with respect to $\Pid$ and $\Pood$, that is,
\begin{equation}
\frob(x) \perp \fstd(x) \mid y \quad \mbox{when $(x, y) \sim P$ for $P \in \{\Pid, \Pood\}$}
\end{equation}
\end{assumption}

\ar{Connect with the previous...maybe start with this line and then say ``in fact, it's wearker...}
\ak{Sorry, didn't get this suggestion, are you suggesting adding ``in fact'' to the start of the paragraph?}
\textbf{Connection with prior assumptions.}
The diverse features assumption is weaker than the assumptions in prior conceptual models of distribution shifts~\cite{chen2020selftraining,sagawa2020overparameterization,nagarajan2020understanding} where robust and spurious features are disjoint parts of the input, each generated independently based on the label. In our setting, the features can be complicated functions of the inputs.
% (that are conditionally independent given the label).

\textbf{Ensemble.} The ensemble $\fens$ simply adds up the predictions of the standard model $\fstd$ and robust model $\frob$. This is slightly different from Section~\ref{sec:methods}, but is more amenable to analysis.
\begin{equation}
    \label{eqn:ensemble_dfn}
    \fens(x) = \fstd(x) + \frob(x)
\end{equation}
\pl{given that we average in 3.3, why don't we do that here instead of sum to make things consistent?}

% \ar{motivate why label distribution is important...this is not something people usually think about and is also not defined in the setup?}
% \ak{Saw this comment, trying to think about how to movivate it in a succinct way}
% \ak{update: discussed on slack, and edited accordingly}
\textbf{Class-balanced.} For simplicity of exposition, we assume the class-balanced setting where every label $P(Y=y)$ is equally likely. Formally, we say $P$ is class-balanced if $P(Y=y) = 1/K$ for all $y \in [K]$. We analyze the general setting in Appendix~\ref{app:analysis_appendix}.
\pl{we want to say this for both $\Pid$ and $\Pood$}\ak{We actually also need it for components of $\Pood$. So here I'm just defining what it means for arbitrary $P$, and then the statements of the results invoke the definition, e.g., assume $\Pid$ is class-balanced. Is that ok?}
% which requires a slight modification to the ensemble formula in Equation~\ref{eqn:ensemble_dfn}.

% \begin{definition}
% We say $P$ is class-balanced if $P(Y=y) = 1/K$ for all $y \in [K]$.
% \end{definition}


% We have a standard model $\fstd$ trained via empirical risk minimization on a training dataset, and a robust model $\frob$ trained with robustness interventions to avoid using aspects of the data specific to in-distribution (ID) such as location information, image background, or domain specific features.
% $\fstd$ leverages ID-specific features so it performs better ID ($\Errid(\fstd) \leq \Errood(\frob)$) but these features don't generalize 

% Standard and robust models typically use different features of the input data---for example,~\citet{xie2021innout, sagawa2020group}[fine-tuning] train a robust model to avoid using spurious aspects of the data such as location information, image background, or domain specific features.
% On the other hand, models trained with empirical risk minimization often rely primarily on these simple spurious features when available and ignore more complicated features of the input [cites].=




%% assumes the input can be split into two sets of features that provide conditionally independent signals of the label: $x = (x_1, x_2)$ with $x_1 \perp x_2 \mid Y$, where the standard model primarily uses $x_1$ and the robust model uses $x_2$.
%% In this case we have $\fstd(x_1) \perp \frob(x_2) \mid Y$, so our assumption holds.

% For example, standard ERM models often rely primarily on simple spurious features when available and ignore more complicated features of the input, whereas robust models (cite) are trained to project out spurious features

% \ar{I somehow don't like ``optimal ensembling'' and would rather use ``ensemble that achieves the best ID performance''. Optimal could mean many things...especially when ensembling is itself not formally defined}
% \ar{Also, we haven't formally defined what an ensembling is, so maybe we should do that atleast informally in the beginning of this section. Something like we want a general method that doesn't tailor to the particular robust training strategy or internals of the model. So we assume only access to the model confidences of the standard and robust model and aim to get the best of both worlds by leveraging the individual confidences} 
% \ak{True, changed to a more neutral ``ID performance of ensembles'', and the next section is ``OOD performance of ensembles''}
% \ak{Defined ensembles now in the previous section! Does that look good?}

\subsection{ID performance of ensembles}
\label{sec:analysis_id}

In this section, we show that if $\fstd$ and $\frob$ are \emph{calibrated} with respect to $\Pid$, then the ensemble $\fens$ is the best way to combine their predictions.
Since we have access to validation data from $\Pid$, the first step of our method (Section~\ref{sec:methods}) is to calibrate $\fstd$ and $\frob$ ID.
\pl{this is very confusing since you have mutation;
$\fstd$ and $\frob$ are given as input to your algorithm,
so you should compute the temperature scaling $T$;
but then it's not $\fstd$ that's calibrated, but rather $\fstd/T_{std}$...
}
We conclude the section by giving intuition for why this calibration step can be particularly important for deep neural networks.
% We first examine the ID performance of the ensemble $\fens$, and in Section~\ref{sec:analysis_ood} we examine the OOD performance.
% We show that if $\fstd$ and $\frob$ are \emph{calibrated} with respect to $\Pid$, then the best way to combine them is to ensemble them: to add up their pedictions and output the label with highest confidence.

Intuitively, calibration means that the probability that a model outputs for an event reflects the true frequency of that event: if a model says 1,000 patients have the flu with probability 0.1, approximately 100 of them should indeed have the flu.
Formally, we look at joint calibration~\citep{murphy1973vector, brocker2009decomposition} where a model $f$ is calibrated with respect to a distribution $P$ if for all $x \in \cX, y \in [K]$:
\begin{equation}
P(y \mid f(x)) = \softmax(f(x))_y
% P(Y = y \mid f(X) = f(x)) = \softmax(f(x))_y
\end{equation}

The following proposition says that if $\fstd$ and $\frob$ are calibrated on $\Pid$, then $\fens$ has lower error on $\Pid$ than any other way of combining the two models---this also implies that $\fens$ gets higher accuracy than $\fstd$ and $\frob$. 
\newcommand{\calibrationEnsembleOptimalText}{
Suppose that $\fstd$ and $\frob$ are calibrated with respect to $\Pid$, and that $\Pid$ is class-balanced.
% Let $\fens(x) = \frob(x) + \fstd(x)$.
Let $h : \R^K \times \R^K \to \R^K$ be an arbitrary function that combines the standard and robust model's predictions, and let $f_h$ be the resulting classifier: $f_h(x) = h(\fstd(x), \frob(x))$.
The ensemble is better than any such combination classifier $f_h$: $\Errid(\fens) \leq \Errid(f_h)$.
}
\begin{proposition}
\label{prop:calibration-ensemble-optimal}
\calibrationEnsembleOptimalText{}
\end{proposition}

The proof of Proposition~\ref{prop:calibration-ensemble-optimal} is in Appendix~\ref{app:analysis_appendix}. Intuitively, since $\frob(x) \perp \fstd(x) \mid y$, the Bayes optimal predictor is proportional to multiplying their predicted probabilities, which is equal to adding logits (logits are in log space).
Proposition~\ref{prop:calibration-ensemble-optimal} has an important condition: the two models must be calibrated.
In practice, deep learning models are miscalibrated~\citep{guo2017calibration}, so our first step (Section~\ref{sec:methods}) is to calibrate the models ID.
We explain why the ID calibration step is important for deep neural networks.

\textbf{Why neural networks are miscalibrated.}
Deep neural networks are typically large enough to memorize the training dataset, and are encouraged to magnify their weights (and hence their confidence) to decrease the training loss~\citep{mukhoti2020calibrating,bai2021dont}.
The extent of this miscalibration and overconfidence depends on the training procedure~\citep{hendrycks2019pretraining,desai2020calibration}.
In our case $\fstd$ and $\frob$ are trained in different ways and have different calibration (Appendix~\ref{sec:per-dataset-calibration-appendix}).

\textbf{Why this miscalibration can hurt ensembling.}
% If we directly ensemble these miscalibrated models, ensembles may not get the best of both worlds.
Concretely, consider two models $\fstd'$ and $\frob'$ which are calibrated on $\Pid$.
Let $\fstd(x) = M \fstd'(x)$ for large $M \in \R$ (this magnifies its weights as discussed above), and let $\frob = \frob'$.
$\fstd$ and $\fstd'$ have the same predictions and therefore accuracy but $\fstd$ is highly miscalibrated.
The ensemble is then given by $\fens(x) = M \fstd'(x) + \frob'(x)$.
For very large $M$, $\fens$ and $\fstd'$ have the same predictions---this means that $\Errood(\fens) = \Errood(\fstd) < \Errood(\frob)$, and so ensembling does not get the best of both worlds.
\ak{improve this locally, seems like it could be confusing}

\ak{Maybe add summary}

\pl{but if fstd and frob are both miscalibrated in the same magnitude,
it's fine, so it's really about the two being miscalibrated to different extents...
worth clarifying}

% \ar{There are two parts here: (i) one is that ensembling miscalibrated models is bad and (ii) overparameterized models are miscalibrated. We should state them both clearly one after the other and not go back-forth. I prefer the reversed order of what you have and maybe replacing the paragraph titles with something more formal/precise. Avoid terms like ``this''}
% \ak{I've changed the section substantially based on your suggestions! Also made the titles more precise. I didn't change the order because: the logical train of thought would then be (i) ensembling miscalibrated models can be bad, (ii) neural nets are miscalibrated, which does not establish the conclusion that ensembling neural nets can be bad, because maybe its the type of miscalbrated model that's not bad (we didn't say all miscalibrated models are bad). Specically, we show that neural networks are miscalibrated in a certain way, and that certain way is bad for ensembling}
%\ar{It is not clear to me why robust models are typically less confident than standard...I'd avoid that and just say empirically we observe robust models are less confident than standard models to keep it simple}
% \ak{removed, good point}
%  \ar{I think this part should also connect with the proposition better. Maybe start with the proposition tells us that calibration is sufficient for ensembling to be optimal. But is it necessary? We study that in this section. We make two observations (i) ensembling can be highly suboptimal if the models are miscalibrated (ii) overparameterized models such as deep networks are typically miscalibrated}
% \ak{Redid this based on your suggestions}


% As a corollary, the ensemble is better than the standard model and the robust model: $\Errid(\fens) \leq \Errid(\fstd)$ and $\Errid(\fens) \leq \Errid(\frob)$
% If the standard and robust model disagree sometimes (and the standard model is more confident sometimes, and the robust model at other times), the ensemble is strictly better than either model: $\Errid(\fens) < \Errid(\frob)$ and $\Errid(\fens) < \Errid(\fstd)$.

% \textbf{Importance of calibration.}
% Proposition~\ref{} shows calibration is \textit{sufficient} for getting the best of both worlds.
% - We now explain if the models are not calibrated, then adding their predictions may not get the best of both worlds.
% - Suppose \fstd and \frob are calibrated
% - Scaling up the classifier, \fstd, M \fstd does not change the predictions and has the same accuracy, however is highly overconfident / miscalibrated
% - Then adding up the predictions: f = M \fstd + \frob, for very large M, will make the same predictions as the standard model
% - So then \Err(f) = \Err(\fstd) < \Errood(\frob)

% Before we dive into OOD performance, we first discuss some practical implications from Proposition~\ref{prop:calibration-ensemble-optimal}.

% \paragraph{Overparameterization and the need for calibration.}

% Proposition~\ref{prop:calibration-ensemble-optimal} seems to suggest the most natural ensembling strategy with one key caveat---it requires that the two models are calibrated.
% In practice deep learning models are miscalibrated~\citep{guo2017calibration}, and typically highly overconfident in their predictions. In this case, it might not no longer be optimal to simply add the model confidences, and hence the first step in our ensembling approach is to calibrate the models ID.
% We explain why this calibration can be important when ensembling standard and robust models for the best ID performance.
% \ar{I didn't go over the rest of this section too carefully}

% \textbf{Why calibration is important.}



% Deep networks are large enough to memorize the entire training dataset, and are encouraged to magnify their weights (and hence their confidence) so as to minimize the training loss CITES
% For example, if $\bar{f}$ gets 100\% training accuracy, then multiplying its weights $f(x) = M \bar{f}(x)$ where M > 1, further decreases the cross-entropy loss.
% This leads to miscalibration---the extent of miscalibration depends on the training procedure, architecture, and size CITES
% In our case the standard and robust models are trained in completely different ways and have very different 
% Deep networks are highly overparameterized and can often fit every example to get 100\% train accuracy.
% We typically fit deep learning classifiers by optimizing a loss (such as the cross-entropy loss) on some training data.
% \ar{Unnecessary to write this training loss}
% Let the training loss over $n$ examples be given by:
% \begin{equation}
% \hat{L}(f) = \sum_{i=1}^n l(f(x_i), y_i),
% \end{equation}
% where $l$ is the cross-entropy loss $l(s, p) = \sum_i p_i \log(\softmax(s)_i)$.
% Suppose that $f$ gets 100\% train accuracy so that $\argmax_k f(x_i)_k = y_i$ for all training examples $i = 1, \ldots, n$.
% Then, multiplying the model outputs by a large constant drives the loss to 0: $\hat{L}(c f) < \hat{L}(f)$ if $c > 1$, and $\hat{L}(c f) \to 0$ as $c \to \infty$, where $c f : \cX \to \R^K$ denotes scaling the outputs of $f$ by $c$ so that $(cf)(x) = c(f(x))$.
% This means optimizing the loss drives $c$ to infinity to reduce the loss: the outputs of the model get sharpened and the classifier becomes highly overconfident.
% The extent of this overconfidence depends on details such as how long each model was trained on (the more steps of gradient descent we take the larger $c$ becomes and hence the more overconfident the model), the model architecture, degree of overparameterization, etc~\citep{guo2017calibration,minderer2021revisiting}.
% \ar{Is there a reference that you can cite here?}
% \ak{Done!}

% \textbf{How this affects model combination}:
% \ar{What's this?} This means the ensemble of the standard and robust models can be written as $\fens(x) = \cstd {\fstdbar}(x) + \crob {\frobbar}(x)$, where ${\fstdbar}$ and ${\frobbar}$ are normalized versions of the classifier. \ar{we haven't defined what a normalized version is, and where exactly this equation came from. You might also want to start with more context. State what the classifiers are and what the assumptions are (miscalibrated). I also wonder if this can be made some kind of proposition}
% If $\cstd$ is much larger than $\crob$, then the ensemble predictions are the same as ${\fstdbar}$, but if $\crob$ is very large, then the ensemble predictions are the same as ${\frobbar}$.
% So there's no clear guarantee on how these classifiers will be weighted---in practice, robust models are typically trained via some form of regularization or projecting out inputs so $\crob < \cstd$ and the ensemble uses the standard classifier more than the robust classifier. \ar{It is not clear to me why robust models are typically less confident than standard...I'd avoid that and just say empirically we observe robust models are less confident than standard models to keep it simple}
% Calibrating each classifier ensures that the ensemble is able to leverage both predictors properly. \ar{I think this part should also connect with the proposition better. Maybe start with the proposition tells us that calibration is sufficient for ensembling to be optimal. But is it necessary? We study that in this section. We make two observations (i) ensembling can be highly suboptimal if the models are miscalibrated (ii) overparameterized models such as deep networks are typically miscalibrated}

\subsection{OOD performance of ensembles}
\label{sec:analysis_ood}

We showed that if $\fstd$ and $\frob$ are calibrated on a distribution $P$, then $\fens$ is better than both models on $P$.
However, our validation data is from $\Pid$, so we can only calibrate $\fstd$ and $\frob$ ID.
Even after this ID calibration step, $\fstd$ and $\frob$ are very miscalibrated OOD (on $\Pood$---see Appendix~\ref{sec:per-dataset-calibration-appendix} and~\citet{ovadia2019uncertainty}).

Our goal in this section is to build basic intuitions for when ID-calibrated ensembles can get high OOD accuracy.
% In general, analyzing how deep networks perform OOD is very challenging---
We draw inspiration from distribution shift benchmarks but examine simplified and stylized shifts.
A toy version of our analysis is visualized in Figure~\ref{fig:analysis_intuitions}, where the standard model relies on spurious features that change out-of-distribution.
If these features are ``suppressed'' or ``missing'' OOD, then $\fens$ does better than $\fstd$ and $\frob$ (Figure~\ref{fig:sup_spur}).
However, if these features are anticorrelated OOD (correlated with the opposite label) then the accuracy of $\fens$ is between $\fstd$ and $\frob$ (Figure~\ref{fig:adv_spur}).
We begin by formalizing these shifts, and then analyze the accuracy under these shifts.

% \ar{Set the stage better. So far, we have seen that for the best ID accuracy, we need to ensemble models calibrated ID. The proposition can be generalized to any distribution P suggesting that that for the best OOD accuracy, we need to ensemble models calibrated OOD. Howewver..}
% \ak{Hopefully this is better!}

% Calibration with respect to a distribution requires validation data from the distribution. 
% We do not have access to OOD data, so we \emph{only calibrate models using ID data}.
% Unfortunately, even after calibrating ID, models can still be highly miscalibrated OOD because of the distribution shift (cites), and even their \emph{relative calibration} can be incorrect: the standard model can be less accurate but on average more confident on OOD data (see Section X). \ar{This section also makes me wonder if ``calibrated ensembles'' should be replaced by ``ID-calibrated ensembles''. I think we should make that change and apply everywhere for consistency. Don't use ensembles without qualification unless you're refering to the general class of all possible ensembles}

% Interestingly, we show that \ar{careful - missing qualifier. It should be ``ID-calibrated''}ensembles can still get the best of both models OOD under a variety of shifts in stylized settings. We then draw general lessons from these stylized settings to guide when ensembling ID calibrated models achieve good OOD performance in real world datasets. 
% \ar{consistent wrt world or models}
% \ak{best of both worlds means good ID and OOD. Here we show we get best of both models OOD.}
%% We model a variety of shifts in stylized settings, and interestingly see that ensembles can still get the best of both the standard and robust model when ID-specific features are simply missing OOD (e.g., in style shift, geographical shift, or subpopulation shift settings).
%% On the other hand, the ensembles accuracy is in between the two constituent models' accuracies if the spurious feature is anti-correlated OOD, or if there is a large label shift.
\begin{figure*}
    
     \begin{center}
     \hfill
	 \begin{subfigure}[b]{0.25\textwidth}
	     \centering
	     \includegraphics[width=\textwidth]{figures/new_theory_part_1.png}
	     \caption{In-distribution}
	     \label{fig:id_no_spur}
	 \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/new_theory_part_2.jpg}
         \caption{Missing spurious}
         \label{fig:sup_spur}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/new_theory_part_3.png}
         \caption{Anticorrelated spurious}
         \label{fig:adv_spur}
     \end{subfigure}
     \hfill
     % \hfill
     \caption{
     % A simple example where the data distribution shifts gradually over time.
     A toy version of our analysis in Section~\ref{sec:analysis}. (Figure~\ref{fig:id_no_spur}) Given a standard model $\fstd$ (red horizontal line) and robust model $\frob$ (blue vertical line) that use different aspects of the data, ensembling their predictions gives a predictor $\fens$ (black dotted line) with lower error---in this case $\fens$ completely separates the positive (green circle) and negative (yellow circle) examples in-distribution (ID). (Figure~\ref{fig:sup_spur}) $\fstd$ uses spurious features, suppose that these features are missing OOD (e.g., the $y$ component of the input goes close to $0$)---then $\fstd$ fares poorly and mislabels half the inputs, but the ensemble $\fens$ is about as accurate as the robust model $\frob$. (Figure~\ref{fig:adv_spur}) On the other hand, suppose the spurious features are \emph{anticorrelated} with the label OOD. In this case $\fens$ intersects the positive (yellow circle) and negative (green circle) distributions, and gets 50\% error---here $\fens$ is worse than $\frob$ but better than $\fstd$.
\ar{Figure can be made nicer and more self-explanatory. What exactly are you trying to convey in these figures actually? You should perhaps shade the misclassified regions of all the classifiers or atleast the ensemble. Also, you are assuming that the models are ID-calibrated so that should be mentioned. Regarding making figure nicer: reduce white space (shorten the axes), increase font size, the orange circles have a blue border - change that, increase thickness of lines. Another thing i noticed is that standard and robust are both equal accuracy ID. Can we change that to make standard model more robust? Also what are the spurious and robust features? They are not marked on the axes.}
     }
	\label{fig:analysis_intuitions}
	\end{center}
	\vskip -0.2in
\end{figure*}

% \subsubsection{Shifts in spurious features}
% \ak{Maybe say something like if you're calibrated OOD then you're good. But ID calibration does not imply OOD calibration (even in practice).}
% \ar{I realized that we have never defined formally what spurious and robust features are. Ideally, this should be folded into assumption 3.1. Or you have to have a setup here in this section about the format of stylized settings you are studying and mention spurious and robust features there.}
% We first describe the different types of shifts in spurious features and their implications for calibrated ensembles. We assume there is no label shift i.e. $\Pood(Y = y) = 1/k$. 

\textbf{Missing spurious.} For our first setting, we draw inspiration from some distribution shift benchmarks. Consider Breeds Living-17~\citep{santurkar2020breeds} where the goal is to classify an image as one of 17 animal categories. The category `bear' in the ID training data contains images of black bears and sloth bears while the OOD dataset has images of brown bears and polar bears.
A standard model trained on the ID dataset might latch onto very specific features about sloth bears (for example the presence of a shaggy mane) which are simply missing in the OOD dataset ($\fstd(x) = 0$).
A robust model could be trained to project these features out~\citep{xie2021innout}, so its predictions are still fairly reliable OOD.
% However, the robust features are still reliable OOD but potentially miscalibrated, leading to the following definition.
% \ar{What is reliable, but miscalibrated? You might want to connect with the $\alpha$ term after the definition below}
% \ak{That's a good point, I'm thinking about how to connect with $\alpha$, but a bit stuck on that}
\begin{definition}[missing spurious]
\label{dfn:missing_spurious}
  A distribution $P_0$ has missing spurious features if for $x \sim P_0$, we have $\fstd(x) = 0$ almost surely and for some $\alpha \in \R^+$, $P_0(Y = y \mid \frob(X) = \frob(x)) = \softmax(\alpha \frob(x))_y$ for all $x \in \cX$.
\end{definition}
\pl{I don't know how to follow the order of quantifiers, which should matter}

\pl{I'm actually pretty confused by this definition - what is being suppressed and what's missing?}

\textbf{Suppressed features.} In some datasets, such as satellite remote sensing datasets~\citep{jean2016combining,xie2021innout}, a standard model can latch onto country-specific features that may be less prevalent OOD.

\begin{definition}[suppressed features]
\label{dfn:suppressed_spurious}
  A distribution $P_\tau$ is said to have suppressed features if $P_{\tau}(Y = y \mid f(X) = f(x)) = \softmax(\tau f(x))_y$ for all $x \in \cX$ and $f \in \{\fstd, \frob\}$, where $\tau \in \R^+$.
\end{definition}
% \ar{A bit funny to have the same suppression factor for both ID and OOD---is this necessary? Sorry, didn't have time to go over the arithmetic. Ideally, it should be written in a way that the robust features are not suppressed or suppressed less than the spurious features. I had a comment from before, but following up on that, why is it called ``suppressed spurious'' but the suppression seems symmetric wrt the features}
% \ak{Sorry, forgot to answer this. Yeah it needs to be the same, otherwise you can find a counter-example. But maybe some more general condition would work! Also, changed it to suppressed features.}

\pl{these conditions look pretty strong since they look like calibration}

\textbf{Anticorrelated spurious.} In some settings, the spurious feature can be correlated with a label ID but \emph{anticorrelated} OOD. For example, in Waterbirds~\citep{sagawa2020group}, the task is to classify if an image contains a waterbird or a landbird where in the ID dataset, waterbirds are primarily featured with water backgrounds and landbirds with land backgrounds, but in the OOD datasets the backgrounds are flipped such that landbirds occur with water backgrounds and vice versa. This motivates the final definition of spurious shifts where the spurious features (background) are anticorrelated with the label OOD. 
\ar{adversarial doesn't imply anticorrelated. I am in favor or anticorrelated spurious rather than adversarial spurious because adversarial means different things to different people}

\begin{definition}[anticorrelated spurious]
  A distribution $\Padv$ is said to be \adv{} if for some $\alpha, \beta > 0$, for all $x \in \cX$, $\Padv(Y = y | \fstd(x)) = \softmax(-\beta \fstd(x))_y$ (note the minus sign), while $\Padv(Y = y \mid \frob(x)) = \softmax(\alpha \frob(x))_y$.
  \end{definition}

If the OOD distribution is a mixture of suppressed features and missing spurious features, then the ensemble $\fens$ gets the best of both worlds.
\newcommand{\suppMissingEnsWorksText}{
If the OOD contains a mixture of suppressed features and missing spurious features i.e., $\Pood = \alpha P_{\tau} + (1 - \alpha) P_0$, and $P_{\tau}$ and $P_0$ are class-balanced, then we have $\Errood(\fens) \leq \Errood(\frob)$ and $\Errood(\fens) \leq \Errood(\fstd)$.
}
\begin{proposition}
\label{prop:supp_missing_ens_works}
\suppMissingEnsWorksText{}
\end{proposition}

On the other hand, if the OOD distribution contains \adv{} features, then the accuracy of $\fens$ is in between the standard and robust models.
\newcommand{\antiCorrelatedEnsFailsText}{
If spurious features are anticorrelated OOD so that $\Pood = \Padv$, then even if $\Padv$ is class-balanced, $\Errood(\frob) \leq \Errood(\fens) \leq \Errood(\fstd)$.
}
\begin{proposition}
\label{prop:anti_correlated_ens_fails}
\antiCorrelatedEnsFailsText{}
\end{proposition}
The full proofs appear in Appendix~\ref{app:analysis_appendix}.

% We pictorially describe the different settings in Figure~\ref{fig:analysis_intuitions}. Intuitively, the missing or suppressed spurious features make the standard model less confident on points that the robust and standard models disagree on (See Figure~\ref{fig:sup_spurious}) even if the standard model is more confident overall \ar{overall is unclear...}OOD. \ar{Either do not mention relative calibration surprise in the section or connect more carefully here}.

% As a broad takeaway, in the absence of any label shift \ar{is it just absence of label shift or more stronger that we need class-balanced labels}, we expect calibrated ensembles to achieve the best OOD performance of standard and robust models if either the spurious features are missing or features are suppressed---as is common in several ``natural'' distribution shifts. However, if the spurious features are shifted adversarially (as is common in synthetic benchmarks to stress test reliance on spurious correlations), calibrated ensembles would not achieve the best OOD performance. We empirically evaluate this in Section~\ref{sec:experiments} on several real-world datasets. 

% \subsubsection{Label shifts}

% % Assuming no label shift, Proposition~\ref{prop:supp_missing_ens_works}, showed that if $\Pood$ has missing or suppressed spurious features, then the ensemble gets lower error than the standard and robust models.
% We show that even if the standard and robust model are calibrated OOD, if $\Pood$ has a different label distribution from $\Pid$, then the ensemble can do worse than the standard or robust model.
% \ar{I wouldn't start with OOD calibrated. Just state for ID calibrated ensembles and then add a sentence that this doesn't go away if we calibrate OOD} We show that even if the standard and robust model are calibrated OOD, but $\Pood$ has a different label distribution from $\Pood$, then the ensemble can do worse than the standard or robust model.

% \newcommand{\imbalancedEnsembleFailsText}{
% There exists a $\Pood$ such that $\fstd$ and $\frob$ are calibrated with respect to $\Pood$, but the ensemble can do worse than both models: $\Errood(\fens) > \Errood(\fstd)$ and $\Errood(\fens) > \Errood(\frob)$. Here $\Pood$ is not class-balanced.
% }
% \begin{example}
% \label{ex:imbalance-ensemble-fails}
% \imbalancedEnsembleFailsText{}
% \end{example}
% \ak{Explain the connection to the missing/suppressed features proposition. Also, do we need this? We don't really talk about label shift in the experiments?}

% \ar{this should be a separate paragraph or section on ``imbalanced labels'' and cannot be part of the section on label shifts. I think this should go into the appendix as a separate section and you should link to it from the place you define class-balanced distributions. Say that for clarity in presentation, in the paper, we focus on class balanced distributions. The appendix discusses different label distributions}
% We note that if $\Pid$ is not balanced, then the optimal ensemble for ID accuracy is given by Lemma~\ref{lem:bayes_prob_softmax} in Appendix~\ref{app:analysis_appendix}: $\fens(x)_y = \fstd(x)_y + \frob(x)_y - \log{\Pid(Y=y)}$ for all $y$.
% This is a more general version of Proposition~\ref{prop:calibration-ensemble-optimal}.

% Intuitively, both $\fstd$ and $\frob$ incorporate the label distribution $\log{P(Y=y)}$ in their prediction, so we subtract to prevent incorporating this term twice.
% In particular, the optimal ensemble is a function of the label distribution $\Pid(Y=y)$, and so changes if the label distribution shifts OOD.



%% \textbf{Example 1: (Suppressed spurious features)}:

%% For example, in a linear model if the spurious input is a random vector in high dimensional space then it will be nearly orthogonal to the standard model weights, and $\fstd \approx 0$.
%% \ar{Move the robust feature assumption to ahead of this subsection on missing spurious features}
%% \ak{That's a good call, although we have this split of $P_0$ and $P_{\alpha}$ so not sure how to move it ahead for both---any suggestions? If we assume the robust feature assumption for $\Pood$ as a whole, I don't think it is necessarily true for the constituents}
%% However, the robust features are still reliable OOD, and we model this by 

%% To model that the robust features are still reliable OOD, but the robust model may not be calibrated OOD, suppose that $P_0(Y = y \mid \frob(X) = \frob(x)) = \softmax(\alpha \frob(x))$ for any $\alpha \in \R^+$ and all $x \in \cX$.
%% We assume no label shift so $\Pood(Y = y) = 1/K$.

%% In this case ensembling does as well as the robust model and the standard model OOD (the inequalities are strict except for degenerate cases):
%% % \ar{Do you have the wrong inequalities}
%% % \ak{Yeah - good catch!}
%% \begin{proposition}
%% If the OOD contains a mixture of suppressed features and missing spurious features so $\Pood = \alpha P_{\tau} + (1 - \alpha) P_0$, then we have $\Errood(\fens) \leq \Errood(\frob) \leq \Errood(\fstd)$.
%% \end{proposition}

%% \textbf{Example 2 (Adversarial spurious)}:
%% In some settings, the spurious feature can be correlated with a label ID but \emph{anticorrelated} OOD.
%% For example, in Waterbirds~\citep{sagawa2020group}, the task is to classify if an image contains a waterbird or a landbird where in the ID dataset, waterbirds are primarily featured with water backgrounds and landbirds with land backgrounds, but in the OOD datasets the backgrounds are flipped.
%% A standard model trained with empirical risk minimization primarily uses the background features (because it is very easy to distinguish water and land backgrounds) and does poorly OOD.
%% In contrast, the robust model is trained via group DRO to ignore the background features.


%% % \ar{Do you have the wrong inequalities}
%% % \ak{Good catch again!}
%% \begin{proposition}
%% If spurious features are anti-correlated OOD so that $\Pood = \Padv$, then $\Errood(\frob) \leq \Errood(\fens) \leq \Errood(\fstd)$.
%% \end{proposition}

%% \ar{I find the label shift different from covarate shift (spurious/robust features etc.) I wonder if you can make this another subsection different from the previous (which come under a subsection called covariate shift)}
%% \ak{Agreed with the broader point. The above examples aren't covariate shift ($P(Y \mid X)$ can change) but they have no label shift either. So maybe some other name? Balanced vs imbalanced datasets?}
