\section{Setup}
\label{sec:setup}

Consider a $K$-class classification task, where the goal is to predict labels $y \in [K]$ corresponding to inputs $x \in \cX$.

\textbf{Models.}
A model $f: \cX \to \R^K$ takes an input $x \in \cX$ and outputs a score $f(x) \in \R^K$ where $f(x)_i$ can be interpreted as the model's ``confidence'' that the label $y$ is $i$.
The model outputs the label $\pred(f(x)) = \argmax_i f(x)_i$. 
The confidence scores can be normalized to sum to $1$ (and interpreted as probabilities) using the softmax function, $\softmax(f(x))_i = \frac{\exp(f(x)_i)}{\sum_{j=1}^K \exp(f(x)_j)}$ for $i \in [K]$.

\textbf{Distributions and error.}
Let $\Pid$ and $\Pood$ denote the underlying distribution of $(x, y)$ pairs in-distribution (ID) and out-of-distribution (OOD), respectively.
We evaluate a model $f$ on the fraction of times it makes a wrong prediction on $\Pid$ and $\Pood$: $\Errid(f) = \E_{x, y \sim \Pid}[ \pred(f(x)) \neq y]$ and $\Errood(f) = \E_{x, y \sim \Pood}[ \pred(f(x)) \neq y]$.
% \ar{I think we can just define $\Errid$ and $\Errood$ directly... feels easier that going via the abstract $\Err(P, \cdot)$ and then also compressing the notation}
% \ak{done}
% \ar{How exactly are the standard and robust models defined? I think standard should be defined as ERM model and robust model is any model that performs better than the ERM model OOD? Right now, it's unclear and confusing}
% \ak{added}

\textbf{Standard and robust models.}
A standard model $\fstd$ is trained via empirical risk minimization \pl{(ERM)} where we minimize some loss on ID training data.
$\fstd$ often relies \pl{/might rely} on spurious correlations \pl{correlation between what and what} such as image background or occurence of certain words that are not necessarily predictive OOD.\ak{TODO: add cites}
% Hence standard models often perform poorly OOD.
% In order to improve OOD performance, the training process needs to be changed (robustness interventions) to discourage models from relying on ID-specific spurious features.
% We call such models $\frob$, where the exact robustness intervention depends on the task. 
In order to improve OOD performance, a robust model $\frob$ is trained via a modified training procedure (robustness interventions) to discourage models from relying on ID-specific spurious features.
Formally, we have the following relationship between $\fstd$ and $\frob$.
\begin{align}
\Errid(\fstd) \leq \Errid(\frob); ~~\Errood(\frob) \leq \Errood(\fstd). 
\end{align}
\pl{this is assuming infinite data...I think in general we need to be clear about this that we're not thinking about generalization?}
\ak{Trying to understand this better---are you saying the standard and robust models might not satisfy this property with finite data? E.g., in In-N-Out if we have very little data, robust model can do better since it uses fewer features? We're just taking this tradeoff as a given though here. Agreed that we should say we don't really look at finite samples in the analysis.}
The precise robustness intervention depends on the task---in Section~\ref{sec:analysis} we model the relationship between $\fstd$ and $\frob$ in a stylized setting amenable for analysis, and in Section~\ref{sec:datasets} we describe what $\fstd$ and $\frob$ are in our real datasets.
% We call such models $\frob$, where the exact robustness intervention depends on the task. 
\ar{Is there an easy example to give here?InNOut and LP/FT both seem a bit difficult to explain intuitively?}
\ak{Thinking---maybe I can just say by projecting out spurious features}

\textbf{Best of both worlds.} Our goal is to get the best of both worlds---a classifier $\fens$ that achieves the strong ID accuracy of the standard model, and OOD accuracy of the robust model:\pl{I think this wording 'strong ID accuracy of the standard model' is a bit confusing...I'd just say we want an $f$ that achieves...}
\begin{align}
\Errid(\fens) \leq \Errid(\fstd); ~~\Errood(\fens) \leq \Errood(\frob). 
\end{align}

% \begin{equation}
% 	\Err(f, P) = \E\limits_{x, y \sim P}[ \pred(f(x)) \neq y]
% \end{equation}
% where $\Errid(f) = \Err(f, \Pid)$ and $\Errood(f) = \Err(f, \Pood)$.
% \Err(f, P) = \E\limits_{x, y \sim \Pid}[ \pred(f(x)) \neq y]\mbox{, and }\Errood(f) = \E\limits_{x, y \sim \Pood}[ \pred(f(x)) \neq y],
% , or fraction of misclassifications on \ar{samples from} a distribution. Formally, for a distribution $P$, we have $\Err(P, \cdot) = \E\limits_{x, y \sim P}[ \pred(f(x)) \neq y]$. Let $\Pid$ and $\Pood$ denote the underlying distribution of $(x, y)$ pairs in-distribution (ID) and out-of-distribution (OOD), respectively.
% In this work, we evaluate models on $\Err(\Pid, \cdot)$ and $\Err(\Pood, \cdot)$ and $\Pood$, denoted by $\Errid$ and $\Errood$ respectively. We measure $\Errid$ and $\Errood$ on held-out test sets drawn from $\Pid$ and $\Pood$ respectively. 
% 
% We have a standard model $\fstd$ and a robust model $\frob$, with a tradeoff: the standard model typically does better ID while the robust model does better OOD: $\Errid(\fstd) > \Errid(\frob)$ but $\Errood(\fstd) < \Errood(\frob)$---our goal is to get the best of both and produce a model that does well both ID and OOD. \ar{Make precise..? Setup should be precise - this sounds like an intro; Either remove this whole part and just have the ID and OOD errors without going into standard and robust models, or make everything more concrete by grounding in terms of ERM}


\textbf{ID validation data.}
% We have training data from $\Pid$, $\{(\xtrain_i, \ytrain_i)\}_{i=1}^{\ntrain} \sim \Pid$.
% \ak{@Aditi, I know you suggested having training data, but I thought about it more and commented this out because we don't use the training data anywhere. Is that OK?}
To get the best of both worlds, we only allow access to ID validation data, $\{(\xval_i, \yval_i)\}_{i=1}^{\nval} \sim \Pid$, for tuning hyperparameters.
Following~\citet{xie2021innout,koh2021wilds,gulrajani2020search} we do \emph{not} use any OOD validation data.
% ---this captures the setting where it is difficult to predict what kind of distribution shifts occur when a model is deployed. 
\ak{Do we need to explain why we don't have OOD data?}
% \ar{Be consistent across period or colon after titles. I think currently it reflects whether you wrote something or i did :) }
% \ak{Changed all to periods :)}
% \ar{I think active voice would be stronger... in this work, we only allow access to ID validation set and maybe take the chance to say that most other works assume some access to OOD information---either validation set or unlabeled data?}
% \ak{Changed to active voice! I agree with you that not using OOD data is a strength of our work, but I want to be careful that we don't confuse the reader, and be mindful there are works that only use ID data like In-N-Out. Any suggestions on what to add here?}

\pl{but now we are in the finite sample regime?}

\pl{I think it'd be clarifying to specify what fstd and frob and fens can depend on...because we don't talk about the training set}

% In addition, we have a validation set $\{(\xval_i, \yval_i)\}_{i=1}^{\nval} \sim \Pid$ that can be used to tune hyperparameters.  Note that the validation set is exclusively from $\Pid$ since OOD data is typically unavailable.
% and we do \emph{not} use any OOD validation data---this captures the practical setting where we cannot accurately predict when and and what kind of distribution shifts occur when a model is deployed. 
