% \vspace{-0.2cm}
\section{Related work}
\label{sec:setting}


% \vspace{-0.2cm}
In this section, we present an overview of different types of
related methods that are in principle applicable for solving semi-supervised
novelty detection.  In particular, we indicate caveats of these methods based on
their categorization with respect to 1) data availability and 2) the surrogate objective
they try to optimize. This taxonomy may also be of independent interest to
navigate the zoo of ND methods. We list a few representative approaches in
Table~\ref{table:taxonomy} and refer the reader to surveys such as
\citet{Bulusu2020} for a thorough literature overview.


% \vspace{-0.2cm}
\subsection{Taxonomy according to data availability}

%% \fy{SSND has unlabeled dataset available. One way is to use them - SSND methods (ours),
%%   other is to ignore them (UND) - don't work. Then there are some in the gray zone
%%   they use some labeled data or prior knowledge related to OOD: OE (synthetic, different),  OOD for hyperparam. tuning, related labeled data during pre-training (fort)}

% \vspace{-0.2cm} In this section we present methods that are ordered by the
% amount of labeled OOD data that they require. 

In this section we present related novelty detection methods that use varying
degrees of labeled OOD data for training.
% \alex{could be confusing re SSND} In what follows, the terms regarding
% (un/semi/)supervised refer to labeling of OOD data.
We call \emph{test OOD} the novel-class data that we want to detect at test
time.

% \fy{I would try to bold + emph them instead of having paragraphs, the paragraph way
%   of writing encourages the wrong style I think}
% \fy{i.e. instead of writing we call method bla if you do this (sounds too textbook-y) i would go with
%   you can do this and we call this type bla}
% 
% \paragraph{Methods for SSND without additional labeled data} \fy{as we've seen
% SSND on NN they don't work, reason is not diverse enough}
%% Several prior works propose methods
%% tailored to the SSND setting \citep{blanchard10, mari2010, duPlessis14,
%% Liu18}. Despite having the advantage that they can adapt to emerging OOD data,
%% these approaches perform poorly on near OOD data sets.

%\paragraph{Unsupervised novelty detection (UND).}
In a scenario like the one in Figure~\ref{fig:practical_sketch},
one can apply \boldemph{unsupervised novelty detection (UND)} methods
that ignore the unlabeled batch and only uses ID data during training
\citep{balaji, gram_ood, nalisnick}. However, these approaches lead to
poor novelty detection performance, especially on near OOD data.

There are methods that suggest to improve UND performance by using additional
data.
% \paragraph{Requires additional data similar to test OOD to work well} \fy{Gray
% zone: There are some methods that try to improve UND using additional data }
For example, during training one may use synthetically generated outliers (e.g.\
\citet{Tack2020, sohn2021}) or a different OOD data set that may be available (e.g.\ OE
and DPN use TinyImages) with samples
\emph{known to be outliers}. 
% You can train to explicitly label outliers or not (contrastive1).
However, in order for these \boldemph{augmented unsupervised ND (A-UND)} methods
to work, they require that the OOD data used for training is similar to test OOD
samples. When this condition is not satisfied, A-UND performance deteriorates
drastically (see Table~\ref{table:main_results}). However, by definition, novel
data is unknown and the only information about the OOD data that is
realistically available is in the unlabeled set like in SSND. Therefore, it is
unknown what an appropriate choice of the training OOD data is for A-UND
methods.
% similar data to test OOD is either already known and labeled ID, or unlabeled
% and hence unknown so that these methods are not feasible.

\at{NEW: position vs ViT} Another line of work uses pretrained models to
incorporate additional data that is close to test OOD samples, i.e.\
\boldemph{pretrained UND (P-UND)}.
% By having classes that are similar to test OOD (give examples for CIFAR100),
% the NN learns better features and has better density estimate there
% ...\at{TODO} \fy{whats the intuition?}
\citet{fort2021} use large transformer models pretrained on ImageNet21k and
achieve good near OOD detection performance when ID and OOD data are similar to
ImageNet samples (e.g.\ CIFAR10/CIFAR100). However, our experiments in
Appendix~\ref{sec:appendix_vit} reveal that this method performs poorly on all
other near OOD data sets, including unseen FashionMNIST or SVHN classes and
X-rays of unknown diseases.  This unsatisfactory performance is apparent when ID
and OOD data
% (e.g.\ X-rays of known and novel diseases, \fy{but even others? SVHN CIFAR10})
do not share visual features with the pretraining data (i.e.\ ImageNet21k).
% In fact, this method shows unsatisfactory performance on all other near OOD
% data sets, including unseen FashionMNIST or SVHN classes and X-rays of unknown
% diseases.
%% when ID and OOD data are similar (e.g.\ X-rays of
%% known and novel diseases), yet they do not share visual features in common with
%% the pretraining data (i.e.\ ImageNet21k).
%% We conjecture that the pretraining
%% data helps to learn features that separate well CIFAR10 classes from CIFAR100,
%% but the model projects all X-rays to a concentrated region in feature space,
%% making them difficult to distinguish with only weak supervision from the ID
%% labels.
Since collecting such large troves of ``similar'' data for pre-training is often
not possible in practical applications (as medical imaging), the use case of
their method is rather limited.
% and training massive ML models like the ones used in \citet{fort2021} comes
% with great concerns about the impact on the environment.

%% \fy{whats difference between this and ours - by taking the flagged as OOD
%% in ``transductive phase'' and can use those as different OOD?}

Furthermore, a few popular methods use test OOD data for calibration or
hyperparameter tuning \citep{mcd_ood, mahalanobis, odin, Ruff2020}, which is not
applicable in practice. Clearly, knowing the test OOD distribution a priori
turns the problem into \boldemph{supervised ND (SND)}, and hence, violates the
fundamental assumption that OOD data is unforeseeable. 

As we have already seen, current \boldemph{SSND} approaches (e.g.\ MCD, nnPU)
perform poorly for complex models such as neural networks. We note that SSND is
similar to using unlabeled data for learning with augmented classes (U-LAC)
\citep{Da2014, guo20, yujie2020} and is related to transductive novelty
detection \citep{scott08, guo20}, where the test set coincides with the
unlabeled set used for training.

%% \paragraph{Augmented unsupervised ND (A-UND).} Alternatively, we call
%% \emph{augmented unsupervised ND} (A-UND) the setting in which a method uses OOD
%% samples that are different from the ones seen at test time. The training OOD
%% data may consist of different but real, known outliers \citep{outlier_exposure,
%% dpn} (\emph{different OOD}) or OOD-like synthetically generated samples to
%% simulate the test OOD data. However, the performance of these methods relies
%% heavily on how similar the proxy OOD data is to test OOD, which makes it
%% cumbersome to thoroughly evaluate and analyze A-UND methods. For instance, DPN
%% and OE both use TinyImages as training OOD data and perform competitively when
%% CIFAR10 (which a subset of TinyImages) is the test OOD set, but are surpassed by
%% other UND baselines when the test OOD data is SVHN (see
%% Table~\ref{table:main_results}). Since novel data is, by definition,
%% unpredictable, it is unknown a priori what a good choice of training OOD data
%% is.  In contrast, SSND methods can adapt to novel data that emerges (and can
%% change) over time, instead of using past OOD data to detect new outliers.

%% \paragraph{Pretrained UND (P-UND).} A couple of concurrent works
%% \citep{fort2021, reiss2021} propose to use pretrained classifiers for
%% novelty detection.
%% In particular, \citet{fort2021} use models pretrained on a large data set
%% (ImageNet21k) which
%% contains many labeled samples that are similar to the unseen CIFAR classes used
%% as OOD data for evaluation. While the method achieves good near OOD detection
%% performance, collecting such large troves of relevant data is often not possible
%% in practical applications (e.g.\ medical imaging) and tuning massive ML models
%% like the ones used in \citet{fort2021} comes with great concerns about the
%% impact on the environment.

%% \paragraph{Supervised ND (SND).} A number of recent approaches use labeled OOD
%% samples from the test distribution for calibration or hyperparameter tuning
%% \citep{mcd_ood, mahalanobis, odin, Ruff2020}. However, knowing the test
%% OOD distribution a priori violates the fundamental assumption that OOD data is
%% unforeseeable.

% \vspace{-0.2cm}
\subsection{Taxonomy according to probabilistic perspective}

Apart from data availability, the methods that we can use in a practical SSND
scenario implicitly or explicitly use a different principle based on a
probabilistic model.  For example, novel-class samples are a subset of the
points that are out-of-distribution in the literal sense, i.e.\ $\iddist(x)
< \alpha$. One can hence \textbf{learn $\iddist$} from unlabeled ID data, which
is however notoriously difficult in high dimensions.

Similarly, from a Bayesian viewpoint, the predictive variance
is larger for OOD samples with $\iddist(x)<\alpha$. Hence, one could instead
compute the posterior $\iddist (y|x)$ and flag points with large variance (i.e.\
high \textbf{predictive uncertainty}). This circumvents the problem with
estimating $\iddist$. However, Bayesian estimates of uncertainty that accompany
NN predictions tend to not be accurate on OOD data \citep{ood_ovadia}, resulting in poor novelty detection performance.
% because you dunno prior dist of weights nor of function
% directly ... (dunno) \fy{why do you expect the other newer ones - think of Alex
% Immer type, still won't do well- would be good to cite this paper by andrew
% gordon wilson's lab ``what are BNN posteriors really like''}

When the labels are available for the training set, we can instead partially
% \fy{you're not really learning it fully but you may be able to distinguish
% large $\iddist$ vs. small}
\textbf{learn $\iddist$ using $y$}. For instance, one could use generative
modeling to estimate the set of $x$ for which $\iddist(x)>\alpha$ via
$\iddist(x|y)$ \cite{mahalanobis, gram_ood}.
% \fy{is that what open-set people are doing?}
Alternatively, given a loss and function space, we may use the labels
indirectly, like in ERD, and use properties of the approximated population error
that imply small or large $\iddist$. 
% \fy{btw we used to explicitly write the assumption that we know an NN with
% high prediction accuracy}

% \fy{ this is a very abstract way to put it, probably don't have to say more
% but i think of it like that (see tex comment)}
%% Assuming NN already have good prediction accuracy \fy{we used to write that
%%   explicitly as assumption - is that still in paper?} we may approximate
%% $\EE_\iddist \loss (f(X),Y)$ using a hold-out validation set of the
%% training data and use the labels indirectly like in ERD: under some
%% smoothness assumptions on the function space, the $x$ for which you
%% can fit arbitrary label without hurting validation accuracy has to be
%% the ones with small $\iddist<\alpha$.  Instead of using assumptions on
%% $P_x$ these methods hence uses assumptions on the labeling function
%% which might be better known.


%%%%% Fanny old 
%% Instead, in this
%% work we \textbf{learn $\iddist$ using $y$}, the labels of the ID training set,
%% in a setting that resembles open-set recognition. These work well when the true
%% data generating model is generative, i.e. a mixture of Gaussians,
%% and hence making it sth like Gaussian Bayes type problem. 

%% either using generative models (which is notoriously difficult in high
%% dimensions) or, implicitly, through one-class classification or PU learning
%% (which tend to produce indistinguishable representations for inliers and
%% outliers when the ID classes are numerous and diverse).

%% Similarly, they can be viewed as samples $(x,y)$ where $\iddist(y)$  is small.
%% Imagine a classification case where classes are clustered and far apart (generative model).
%% Then $P(x)$ will be small for those in the corresponding clusters and hence $P(y|x)$ is high - hae
%% more like for small $P(x)$ the variance is bigger in the Gaussian process sense
%% and hence $P(y|x,D)$ is larger. 

%% predictive uncertainty can only make sense when we have a discriminative model.



%%%% Alex old:

%% We now briefly discuss the different surrogate objectives that methods in the literature
%% use in order to detect OOD samples and refer to the
%% Appendix~\ref{sec:appendix_related_work} for more details.

%% We loosely define OOD samples as all $x$ for which $\iddist(x) <\alpha$, for a
%% small constant $\alpha > 0$. Since the true marginal distribution is unknown, we
%% need to estimate its level sets. We can \textbf{learn $\iddist$} from unlabeled
%% ID data either using generative models (which is notoriously difficult in high
%% dimensions) or, implicitly, through one-class classification or PU learning
%% (which tend to produce indistinguishable representations for inliers and
%% outliers when the ID classes are numerous and diverse). Alternatively, in this
%% work we \textbf{learn $\iddist$ using $y$}, the labels of the ID training set,
%% in a setting that resembles open-set recognition. Finally, one can use
%% calibrated estimates of \textbf{predictive uncertainty} (e.g.\ Bayesian
%% approaches) for OOD detection, although they perform worse than other OOD
%% methods \citep{ood_ovadia}.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "iclr2021_conference"
%%% End:
