\section{Related problems}

\vspace{-0.2cm}
\paragraph{Predictive uncertainty and Bayesian methods.} One of
the important appeals of the Bayesian framework is that it directly provides
uncertainty estimates together with the predictions. Bayesian methods are
particularly useful in the case of covariate shift \cite{Shimodaira2000}, when
predictive uncertainty can be used to decide to abstain \cite{Geifman2017} on
ambiguous samples, while still allowing high-certainty predictions. Approaches
like MC-Dropout \cite{gal2016} or Deep Prior Networks \cite{dpn} attempt to
tackle this problem, but the uncertainty estimates they provide are often
inaccurate on OOD samples \cite{ood_ovadia}. The same problem has been observed
for Bayesian Neural Networks \cite{neal1995, graves2011, blundell2015}, for
which sampling efficiently from the posterior over parameters remains
challenging for large models \cite{ood_ovadia}.

\vspace{-0.3cm}
\paragraph{Transductive learning.}  Transductive learning \cite{vapnik} assumes
that the unlabeled test set is available together with a labeled training set
and both can be used to select a good predictor. Unlike semi-supervised
learning, the transductive framework is only concerned with performing well on
the given test set and is not interested in generalization on holdout data.  In
practice it has been successfully used for problems like zero-shot learning
\cite{zsl_1, zsl_2}. Transductive OOD detection \cite{scott08} is equivalent to
the scenario that we adopt in this paper if the unlabeled set coincides with the
test set used for evaluation (see also
Appendix~\ref{sec:appendix_transductive_results}).

% Moreover, in \cite{scott08} the authors proposed to solve
% the transductive anomaly detection problem by discriminating between the
% training and the test set and constraining the false positive rate using a
% validation set of ID samples (see Appendix~\ref{sec:appendix_discriminator}).
% However, this method does not provide an explicit estimate of the predictive
% uncertainty of a classifier trained on the labeled training set. In contrast,
% $\method$ can be used for prediction, and the disagreement statistic is an
% indication of the ensemble's confidence.

\vspace{-0.3cm}
\paragraph{Domain adaptation.} The OOD detection problem with access to unknown
OOD data is reminiscent of unsupervised domain adaptation (UDA)
\cite{bendavid2007, ganin2016, da_chen}, in that both allow access to an
unlabeled data set to adjust predictors to a new distribution. However, unlike
OOD detection, UDA aims to provide correct predictions on a target distribution
with covariate shift \cite{Shimodaira2000}.  Hence, the UDA problem is ill-posed
if the target distribution contains data from novel classes, not present in the
source set. In novel-class scenarios, one needs to consider OOD detection
instead.

% \vspace{-0.2cm} \paragraph{Ensemble diversity.} Previous works have
% highlighted limitations of ensemble-based OOD detection approaches. Firstly,
% neural networks tend to make overconfident predictions even on test samples
% far from the training distribution \cite{hein}. Moreover, the stochasticity of
% conventional neural network training approaches (e.g.\ random initialization,
% SGD) is not sufficient to obtain ensembles that are diverse enough to give
% good uncertainty estimates on OOD samples \cite{fourier}. Some recent works
% address this problem by adding explicit regularizers that incentivize model
% diversity, either in a transductive setting, like MCD \cite{mcd_ood}, or
% inductively \cite{bahng2019}. However they do not manage to detect OOD samples
% well in hard scenarios. 


% In the case of covariate-shift we would like to use the predictive uncertainty
% to be able to 
% % distinguish which of the OOD data the model can still correctly extrapolate
% % on. The goal is to
% abstain \cite{Geifman2017} on ambiguous samples, on which the model cannot
% extrapolate reliably. For this problem, the goal is to have a low predictive
% error on the samples on which we do not abstain, while still maintaining a high
% coverage on the target data.
% % For this setting, the task boils down to finding a good model of predictive
% % uncertainty for the given combination of distribution over covariates $P_X$
% % and ground truth labeling function.
% Bayesian methods \cite{neal1995, gal2016, dpn} and approaches such as Vanilla
% Ensembles \cite{balaji} are tailored to solving this class of problems.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "iclr2021_conference"
%%% End:
