\section{Datasets}
\label{sec:datasets}
\ak{This section needs to be compressed, maybe just one line for each dataset and intervention, and move the rest of the details to the Appendix}

We consider thirteen standard datasets, spanning multiple robustness interventions, types of shifts, and modalities (vision, language, time-series).
We first describe the robustness interventions we consider, and then describe the datasets and types of shifts.
All the datasets have been used by prior works on robustness, so we use their model checkpoints for reliable comparisons.
See Appendix~\ref{sec:more-info-datasets-appendix} for more details.
% \ak{Say all these datasets have been used by prior work and we use their checkpoints---but the purpose of this is to show the problem occurs widely}


% We run experiments on two remote sensing datasets used in prior work studying ID-OOD tradeoffs~\citep{xie2021innout}.
% These datasets consist of a core input $x$ (image data or time series data) and metadata $z$ (e.g., location, meteorological climate data). 
% The metadata is spuriously correlated with the target---using the metadata to predict labels improves accuracy in-distribution (ID), but hurts accuracy out-of-distribution.
% ~\citet{xie2021innout} consider a standard model that takes in both the core inputs and metadata to predict the target, and a robust model that only takes in the core inputs and does some additional pretraining.
% They call these the `aux-in' and `aux-out' models respectively.

\textbf{Robustness interventions}:
\begin{enumerate}
	\item In-N-Out:~\citet{xie2021innout} use domain knowledge to project out spurious features in the input, and do an additional pretraining step. They call this robust model ``aux-out'' and show that it improves accuracy OOD, but hurts accuracy ID, compared to ERM.
	% We use their datasets and model checkpoints.
% They call these the `aux-in' and `aux-out' models respectively.
	\item Lightweight fine-tuning:
	% Recent works show that tuning only parts of a pretrained model can often do better OOD even though the ID performance is worse~\citep{li2021prefix,houlsby2019parameter}.
	We take checkpoints from~\citet{kumar2022finetuning} where the standard model fine-tunes all parameters on an ID dataset, and the robust model only learns the top linear `head' layer (which does better OOD but worse ID).
	\item Zero-shot language prompting: CLIP~\citep{radford2021clip} is a multi-modal model that can predict the label of an image by comparing the image embedding, with prompts such as `photo of an apple'. They show that this zero-shot language prompting approach (robust model) is more accurate OOD than fine-tuning the entire model (standard model), although ID accuracy of the robust model is worse.
	\item Group distributionally robust optimization (DRO)~\citep{sagawa2020group}: Standard ERM models often latch on to spurious correlations in a dataset, such as image background color, or the occurrence of certain words in a sentence. Group DRO essentially upweights examples where this spurious correlation is not present.
	\item CORAL~\citep{sun2016deep} aims to align feature representations across different domains, by penalizing differences in the means and covariances
	of the feature distributions.
	The hope is that this generalizes better to OOD domains.
	% The original formulation in~\citet{sagawa2020group} assumes the spurious correlations are annotated, but newer variants~\citep{liu2021jtt} can work even without these annotations.
 \end{enumerate}

We consider three types of \natshifts{} (geography shifts, subpopulation shifts, style shifts), and we also consider adversarially synthesized ``anticorrelated'' spurious shifts.

\textbf{Geography shifts.} In geography shifts the ID data comes from some locations, and the OOD data comes from a different set of locations. One motivation is that in many developing areas training data may be unavailable because of monetary constraints~\citep{jean2016combining}.
\begin{enumerate}
	\item \textbf{LandCover}~\citep{russwurm2020meta}: The goal is to classify a satellite image into one of 6 land types (e.g., "grassland", "savannas"). The ID data contains images from outside Africa, and the OOD data consists of images from Africa.~\citet{xie2021innout} use the In-N-Out intervention.
	\item \textbf{Cropland}~\citep{wang2020weakly}: The goal is to predict whether a satellite image is of a cropland or not. The ID dataset contains images from Iowa, Missouri, and Illinois, and the OOD dataset contains images from Indiana and Kentucky.~\citet{xie2021innout} use the In-N-Out intervention.
	\item \textbf{iWildCam}~\citep{beery2020iwildcam,koh2021wilds}: The goal is to classify the species of an animal given a photo taken by a camera placed in the wild. The ID dataset consists of photos taken by over 200 cameras, and the OOD dataset consists of photos taken by held-out cameras placed in different locations.~\citet{koh2021wilds} use the CORAL intervention.
\end{enumerate}
\ak{Optionally move the discussion of what's core and spurious to the Appendix}

\textbf{Subpopulation shifts.} In subpopulation shifts, the ID data contains a few sub-categories (e.g., black bear and sloth bear), and the OOD data contains different sub-categories (e.g., brown bears and polar bears) of the same parent category (e.g., bears). For both datasets below,~\citet{kumar2022finetuning} use the lightweight fine-tuning intervention.
\begin{enumerate}
	\item \textbf{Living-17}~\citep{santurkar2020breeds}: the goal is to classify an image as one of 17 animal categories such as ``bear'', where the ID and OOD datasets have different species of bears. 
	\item \textbf{Entiy-30}~\citep{santurkar2020breeds}: similar to Living-17, except the goal is to classify an image as one of 30 entity categories such as ``food'', ``motor vehicle'', and ``insect''.
\end{enumerate}
\ak{Optionally move the discussion of what checkpoints we used to the Appendix.}

\textbf{Style shifts.} In style shifts, the ID data has a certain style (e.g., sketches), and the OOD data has a different style (e.g., real photos, renditions). 
\begin{enumerate}
	\item \textbf{DomainNet}~\citep{peng2019moment}: a standard domain adaptation dataset. Here, our ID dataset contains ``sketch'' images (e.g., drawings of apples, elephants, etc), and the OOD dataset contains ``real'' photos of the same categories.~\citet{kumar2022finetuning} use the lightweight fine-tuning intervention.
	\item \textbf{CelebA}~\citep{liu2015deep}: the goal is to classify a portrait of a face as ``male'' or ``female'' - the ID dataset contains images of people without hats, and the OOD dataset contains images of people wearing hats (some facial features might be ``suppressed'' or ``missing'' with hats).~\citet{xie2021innout} use the In-N-Out intervention.
	\item \textbf{CIFAR->STL}: standard domain adaptation dataset~\citep{french2018selfensembling}, where the ID is CIFAR-10~\citep{krizhevsky2009learningmultiple}, and the OOD is STL~\citep{coates2011stl10}. The task is to classify an image into one of 10 categories such as ``dog'', ``cat'', or ``airplane''.~\citet{kumar2022finetuning} use the lightweight fine-tuning intervention.
	\item \textbf{ImageNet}~\citep{russakovsky2015imagenet}: a large scale dataset where the goal is to classify an image into one of 1000 categories.~\citet{radford2021clip} use the zero-shot language prompting intervention. We evaluate on 3 standard OOD datasets: \textbf{ImageNetV2}~\citep{recht2019doimagenet},\textbf{ImageNet-R}~\citep{hendrycks2020many}, and \textbf{ImageNet-Sketch}~\citep{wang2019learningrobust}.
\end{enumerate}

\textbf{\Advshifts{}.} In these adversarially synthesized shifts, the ID dataset contains a feature that is correlated with a label, but this correlation is flipped OOD.~\citet{jones2021selective} use the group DRO intervention.
% For example, waterbirds is explicitly constructed so that ``water'' backgrounds are correlated with ``waterbird'' labels in the ID, but anti-correlated OOD.
\begin{enumerate}
	\item \textbf{Waterbirds}~\citep{sagawa2020group}: The goal is to classify an image as a ``waterbird'' or ``landbird''. The dataset is synthetically constructed to have \adv{} features: ``water'' backgrounds are correlated with ``waterbird'' labels in the ID, but anticorrelated OOD.
	\item \textbf{MNLI}~\citep{williams2018broad}: The goal is to predict whether a hypothesis is entailed, contradicted by, or neutral to an associated premise.~\citet{sagawa2020group} partition the dataset so that ``negation'' words are correlated with the contradiction label ID but these words are anticorrelated with the contradiction label OOD.
	\item \textbf{CivilComments}~\citep{borkan2019nuanced}: The goal is to predict whether a comment is toxic or not.~\citet{jones2021selective} partition the dataset so that in the ID split mentions of a Christian identity are correlated with non-toxic comments, but in the OOD split mentions of a Christian identity are correlated with a toxic comment. CivilComments is also used in~\citet{koh2021wilds}.
\end{enumerate} 
\ak{Technically, the group DRO works look at worst-case accuracy over various groups. The OOD I'm describing is what gets selected as the worst group.}

% We run experiments spanning three different types of robustness interventions: projecting out spurious metadata, language prompting, and freezing pretrained features.
% These experiments span multiple data modalities and model architectures.

% \subsection{Spurious metadata}

% We run experiments on two remote sensing datasets used in prior work studying ID-OOD tradeoffs~\citep{xie2021innout}.
% These datasets consist of a core input $x$ (image data or time series data) and metadata $z$ (e.g., location, meteorological climate data). 
% The metadata is spuriously correlated with the target---using the metadata to predict labels improves accuracy in-distribution (ID), but hurts accuracy out-of-distribution.
% ~\citet{xie2021innout} consider a standard model that takes in both the core inputs and metadata to predict the target, and a robust model that only takes in the core inputs and does some additional pretraining.
% They call these the `aux-in' and `aux-out' models respectively.

% \paragraph{Cropland.} The goal is to predict whether a satellite image is of a cropland or not. The core input $x$ is an RGB satellite image, and the metadata $z$ consists of location coordinates and vegetation bands. The original dataset is from~\citet{wang2020weakly}, and we use U-net model checkpoints from~\citet{xie2021innout}.

% \paragraph{Landcover.} The goal is to predict the land type from satellite data at a given location. Here, the core input $x$ is a time series measured by NASA's MODIS satellite~\citep{modis2015landcover}, and $z$ is climate data (e.g., temperature) at that location. The dataset is from~\citet{gislason2006landcover, russwurm2020meta}. We use model checkpoints from~\citet{xie2021innout} where they use 1D convolutions for time series data.

% \subsection{Zero-shot language prompting}

% ~\citet{radford2021clip} (CLIP) pretrain a model on a large multi-modal language and vision dataset.
% The model can then predict the label of an image by comparing the image embedding, with the language embedding for prompts such as `photo of an apple' or `photo of a banana'.
% They show that this zero-shot language prompting approach can be much more accurate out-of-distribution than the traditional method of fine-tuning the entire model.

% \paragraph{ImageNet $\to$ ImageNet-R.}
% We use a CLIP vision transformer, specifically a ViT-B/16, which is the best publicly available model.
% The robust model uses language prompts to make zero-shot predictions on ImageNet-Renditions~\citep{hendrycks2020many}, a dataset containing cartoon, graffiti, video game, etc, renditions of ImageNet classes.
% The standard model initializes with weights from the CLIP model, and fine-tunes on ImageNet~\citep{russakovsky2015imagenet} training data for 10 epochs with a batch size of 64, initial learning rate of 0.0001 with a cosine learning rate decay, before making predictions on ImageNet-R.
% We note that the robust model gets 10\% lower accuracy ID (on ImageNet validation examples), but gets 30\% higher accuracy OOD (on ImageNet-R test examples)

% \subsection{Freezing pretrained features}

% When adapting a pretrained model to an ID dataset, typically all the model parameters are fine-tuned.
% Recent work looks at `lightweight' fine-tuning, where only parts of the model are adapted---this can often do better OOD even though the ID performance is worse~\citep{li2021prefix,houlsby2019parameter}.
% We consider three distribution shift datasets where the standard model starts from a pretrained initialization and fine-tunes all parameters on an ID dataset, and the robust model only learns the top linear `head' layer.

% \paragraph{DomainNet.} A standard domain adaptation dataset~\citep{peng2019moment}. Here, our ID dataset contains `sketch' images (e.g., drawings of apples, elephants, etc), and the OOD dataset contains `real' photos of the same categories. We use the version of the dataset from~\citet{tan2020coal}. We start from a CLIP pretrained ResNet50 and either fine-tune for 50 epochs with batch size 64 and learning rate 0.001 with cosine learning rate decay (to get a standard model) or train the head layer using sklearn logistic regression (to get a robust model).

% \paragraph{CIFAR-10 $\to$ STL.} Another standard domain adaptation dataset~\citep{french2018selfensembling}, where the ID is CIFAR-10~\citep{krizhevsky2009learningmultiple}, and the OOD is STL~\citep{coates2011stl10}. We start from a ResNet50 pretrained on unlabeled ImageNet examples using MoCo-v2~\citep{chen2020improved} and either fine-tune for 20 epochs with a batch size of 64 and learning rate of 0.001 with cosine learning rate decay (to get a standard model) or train the head layer using sklearn logistic regression (to get a robust model).

% \paragraph{Living-17.} Part of the BREEDS benchmark~\citep{santurkar2020breeds}, here the goal is to classify an image as one of 17 animal categories such as `bear'---the ID dataset contains images of black bears and sloth bears and the OOD dataset has images of brown bears and polar bears. We start from a ResNet50 pretrained on unlabeled ImageNet examples using MoCo-v2~\citep{chen2020improved} and either fine-tune for 20 epochs with batch size 64 and learning rate of 0.001 with cosine learning rate decay (to get a standard model) or train the head layer using sklearn logistic regression (to get a robust model).

