\section{Introduction}

Despite achieving great in-distribution (ID) prediction performance, deep neural
networks (DNN) often have trouble dealing with test samples that are
out-of-distribution (OOD), i.e.\ test inputs that are unlike the data seen
during training. In particular, DNNs often make incorrect predictions with high
confidence when new unseen classes emerge over time (e.g.\ undiscovered bacteria
\citep{jieren}, new diseases \citep{Katsamenis20}).
% during training.  For example, DNNs often make incorrect predictions with high
% confidence when new unseen classes emerge over time (e.g.\ undiscovered
% bacteria \citep{jieren}, new diseases \citep{Katsamenis20}), or when data
% suffers from distribution shift (e.g.\ corruptions \citep{coos_data},
% environmental changes \citep{da_kumar}).
Instead, 
%When some inputs come from previously unseen classes,
we would like to automatically \emph{detect} such novel samples and bring them
to the attention of human experts.
%% Thus, before deploying machine learning
%% systems in the wild, we would require them to solve two tasks: i) predict test
%% samples; ii) flag samples that could be from previously unseen classes.
% \fy{do we need this 1, 2 here? since it comes back in the examplealso i find
% it unclear now that we ONLY want to solve ii)}

\begin{figure}[t]
  \begin{center}
%     \includegraphics[width=\columnwidth]{figures/practical_sketch.png}
    \includegraphics[width=\columnwidth]{figures/practical_sketch.pdf}
  \end{center}

%   \vspace{-0.3cm}
  \caption{
%     \small{
      Novelty detection is challenging since
  X-rays of novel diseases are remarkably similar to known conditions. The unlabeled batch
  of inference-time data can be used to adapt a semi-supervised novelty
  detection approach to emerging novel
  diseases.
%   }
}

  \label{fig:practical_sketch}
\end{figure}


% Other more realistic approaches use surrogate OOD data, either synthetic or
% known outliers, to train a detection model \at{cite}. It is difficult, however,
% to thoroughly gauge the performance of these approaches, since they rely
% crucially on the train OOD data to be similar to test OOD samples and they fail
% to perform well when this assumption is violated.

% \at{maybe we dont have to mention this in intro; move to P-UND in section 2} A
% recent method that is reported to have high near OOD detection performance
% involves tuning large models pretrained on ImageNet21k \citep{fort2021}.
% Notably, the models are pretrained on a large data set that contains many
% labeled samples that are similar to the unseen CIFAR classes used as OOD data
% for evaluation. Collecting such large troves of relevant data is often not
% possible in practical applications and training massive ML models comes with
% great concerns about the impact on the environment \at{cite}.

% Numerous novelty detection methods are successful for simple benchmarks where
% the OOD samples are far from the training samples (such as SVHN vs CIFAR10). As
% \citet{winkens2020} and a number of concurrent works \citep{Tack2020, fort2021}
% recently noted, these methods, however, have subpar performance on near OOD
% data, for instance when OOD samples are drawn from unseen classes from the same
% data set (e.g.\ CIFAR100 vs CIFAR10). A recent method that is reported to have
% high near OOD detection performance involves tuning large models pretrained on
% ImageNet21k \citep{fort2021}. Notably, the models are pretrained on a large data
% set that contains many labeled samples that are similar to the unseen CIFAR
% classes used as OOD data for evaluation.  However, in many scientific
% applications that use, for instance, medical or satellite images, such large
% data sets for pretraining are not available. In particular, truly novel classes
% will be inherently dissimilar from any previously available data while still
% sharing some of the same characteristics.

Consider, for instance, a hospital with a severe shortage of qualified
personnel. To make up for the lack of doctors, the hospital would like to use an
automated system for real-time diagnosis from X-ray images (Task I) and a
novelty detection system, which can run at the end of each week, to detect
outbreaks of novel disease variants (Task II) (see
Figure~\ref{fig:practical_sketch}).  In particular, the detection algorithm can
be fine-tuned weekly with the unlabeled batch of data collected during the
respective week.
 
While the experts are examining the peculiar X-rays over the course of the next
week, the novelty detection model helps to collect more instances of the same
new condition and can request human review for these patients.
%The human experts 
%can already recommend that these patients \fy{should be
%  examined by a human expert?}.
%take the predicted
%diagnosis with a grain of salt. \fy{too verbose and informal}
% extra caution when using the diagnosis predicted for these patients.
%After reviewing the X-rays flagged by the detection algorithm,
The human experts can
then label these images and include them in the labeled training set to update
both the diagnostic prediction and the novelty detection systems. This process
repeats each week and enables both diagnostic and novelty detection models to
adjust to new emerging diseases.

%% \fy{do we have to write this caveat here?} This feedback loop can
%% repeat indefinitely, with new diseases being identified with a delay
%% of at most a week -- the time it takes to collect an unlabeled set.
%% New diseases emerge seldomly, and hence, the benefits of even delayed detection
%% greatly outweigh the waiting time.

% SSND approaches are particularly suitable to practical scenarios like this.
% constitute the unlabeled set $\targetset$ that \at{a semi-supervised novelty
%   detection (SSND) \citep{blanchard10} method may use for tuning}
% % the SSND method uses for tuning,
% as illustrated in Figure~\ref{fig:practical_sketch}. This unlabeled data may
% contain ID and OOD samples, albeit without knowing which are the novelties.  If
% a quickly spreading novel disease circulates the patient population, the
% detection model can identify the OOD samples, which are then shown to the
% scarcely available experts.


\begin{figure*}[t]
  \centering

    \includegraphics[width=0.7\textwidth]{figures/setting_ensemble.pdf}

%     \vspace{-0.1cm}
    \caption{
%       \small{
        \textbf{Left:} Sketch of the SSND setting.
    \textbf{Middle and Right:} Novelty detection with a diverse ensemble.
% }
}

   \label{fig:setting_ensemble}
   \vspace{-0.5cm}
\end{figure*}

Note that, in this example, the novelties are a particular kind of
out-of-distribution samples with two properties.
% that satisfy two conditions. \fy{sounds a bit like conditions you want
% fulfilled - with two properties?} 
First, several novel-class samples may appear in the unlabeled batch at the end
of a week, e.g.\ a contagious disease will lead to several people in a small
area to be infected. This situation is different from cases when outliers are
assumed to be singular, e.g.\ anomaly detection problems.  Second, the
novel-class samples share many features in common with the ID data, and only
differ from known classes in certain minute details. For instance, both ID and
OOD samples are frontal chest X-rays, with the OOD samples showing distinctive
signs of a pneumonia caused by a new virus. In what follows, we use the terms
\emph{novelty detection} and \emph{OOD samples} to refer to data with these
characteristics.

Automated diagnostic prediction systems (Task I) can already often have
% \fy{ i'd do two sentences: Task I is already kinda solved. But even if ...
% task II still isn't. that's what we care about in this paper. }
satisfactory performance \citep{Calli2021}. In contrast, novelty
detection (Task II) still poses a challenging problem in these scenarios. Many
prior approaches can be used for semi-supervised novelty detection (SSND), when
a batch of unlabeled data that may contain OOD samples is available, like in
Figure~\ref{fig:practical_sketch}.\footnote{We use the same definition of SSND
  as the survey by \citet{Bulusu2020}, whereas some works use the term to refer
  to supervised \citep{Gornitz2013, Daniel2019, Ruff2020} or unsupervised ND
\citep{Song2017, ganomaly2018} according to our taxonomy in
Section~\ref{sec:setting}.} However, all of these methods fail to detect
novel-class data when used with complex models, like neural networks.

Despite showing great success on simple benchmarks like SVHN vs CIFAR10, SOTA
unsupervised OOD detection methods perform poorly on near OOD data
\citep{winkens2020} where OOD inputs are similar to the training samples.
Furthermore, even though unlabeled data can benefit novelty detection
\citet{scott09}, existing SSND methods for deep neural networks \citep{Kiryo17,
guo20, yujie2020, mcd_ood} cannot improve upon unsupervised methods on near OOD
data sets. Even methods that violate fundamental OOD detection assumptions by
using known test OOD data for hyperparameter tuning \citep{odin, mahalanobis,
mcd_ood} fail to work on challenging novelty detection tasks. Finally, large
pretrained models seem to solve near OOD detection \citep{fort2021}, but they
only work for extremely specific OOD data sets (see Section~\ref{sec:setting}
for details).

This situation naturally raises the following question:

\vspace{-0.35cm}
\begin{center}
  \emph{Can we improve semi-supervised novelty detection\\for neural
  networks}?
\end{center}
\vspace{-0.2cm}

% %to adapt to novel OOD data that may emerge over time.
% Existing frameworks that can be used for novel-class detection in our setting
% include unsupervised \citep{yang2021}  \fy{not sure where to put survey} and
% semi-supervised OOD detection  \citep{blanchard10}\footnote{We use the same
%   definition of SSND as the survey by \citet{Bulusu2020}, whereas some works use
% the term to refer to supervised \citep{Gornitz2013, Daniel2019, Ruff2020} or
% unsupervised ND \citep{Song2017, ganomaly2018} according to our taxonomy in
% Section~\ref{sec:setting}.}, open-set recognition \citep{Geng2021}, uncertainty
% estimation \citep{ood_ovadia} one-class classification \citep{ruff18}).
% % \footnote{In what follows, we use the terms novelty detection (ND) and OOD
% % detection interchangeably.}
% 
% Despite showing great success on simple
% benchmarks like SVHN vs CIFAR10, most SOTA methods perform poorly on near OOD
% data \citep{winkens2020} where OOD inputs are
% % drawn from unseen classes from the same data set (e.g.\ CIFAR100 vs CIFAR10),
% similar to the training samples (e.g.\ X-rays of a previously unseen
% disease measured by the same machine / hospital/technician \at{not just similar,
% but same modality/type of imagery?}) \fy{maybe citation here?}.
% Furthermore, even though unlabeled data can benefit novelty detection
% \citet{scott09} \fy{maybe other non NN SSND methods here?}, existing
% SSND methods for deep neural networks \citep{Kiryo17, guo20,
%   yujie2020, mcd_ood} \fy{do not improve upon unsupervised...} fail to
% improve upon unsupervised methods when identifying near OOD samples.\at{not just
% compared to UND; why not keep it general?}
% 
% This even applies to methods that violate fundamental OOD
% detection assumptions by using known \fy{is it clear what known means
%   here?} test OOD data for hyperparameter tuning \citep{odin,
%   mahalanobis, mcd_ood}. 
% %% violate fundamental OOD
% %% detection assumptions by using known \fy{is it clear what known means
% %%   here?} test OOD data for hyperparameter tuning \citep{odin,
% %%   mahalanobis, mcd_ood}.
% On the other hand, recent solutions that do
% achieve good performance on near OOD require pretraining on massive
% amounts of data (e.g. ImageNet for
% CIFAR10 vs. CIFAR100 \citep{fort2021}) \fy{or labeled same OOD data?
%   like in abstract?}. \fy{In particular, the data used for
%   pre-training caontains similar classes ...? Is this a caveat
%   they also mention themselves or whats the evidence?}


%% \fy{flow is not good}
%% In our example/ Figure \ref{fig:practical_sketch}, a batch of unlabeled data is available at inference time, a
%% novelty detection algorithm can use the unlabeled batch to adapt to novel OOD
%% data that may emerge over time.
%% This scenario is known in the literature as the
%% semi-supervised novelty detection (SSND) setting \citep{blanchard10}\footnote{We
%%   use the same definition of SSND as the survey by \citet{Bulusu2020}, whereas
%%   some works use the term to refer to supervised \citep{Gornitz2013, Daniel2019,
%%   Ruff2020} or unsupervised ND \citep{Song2017, ganomaly2018} according to our
%%   taxonomy in Section~\ref{sec:setting}.}.



% \fy{otherwise original question also had the risk of sounding trivial, so I added sth to be more specific? - but in general the positioning is important -are we still saying: SSND currently don't work better than unsupervised- we make it work. or: solutions that solve near OOD cheat -  we make it work. ideally question should capture both - but its hard to make concise!}
%% Even though previous works on semi-supervised novelty detection (SSND)
%% have shown that unlabeled data can benefit novelty detection
%% \citet{scott09} \fy{maybe other non NN SSND methods here?}, existing
%% SSND methods for deep neural networks \citep{Kiryo17, guo20,
%%   yujie2020, mcd_ood} \fy{do not improve upon unsupervised...} fail to
%% successfully identify near OOD samples 

%% \fy{otherwise original question also had the risk of sounding trivial, so I added sth to be more specific?}
%% Aligning with intuition, \citet{scott09} have proven
%% that unlabeled data should indeed benefit novelty detection
%% performance in the context of Neyman-Pearson classification \fy{you mean in the NP sense?}. Despite this
%% positive result \fy{for simple models?}, existing SSND methods for deep neural networks \citep{Kiryo17,
%%   guo20, yujie2020, mcd_ood} fail to successfully identify near OOD samples \fy{i feel like maybe other teaser plot could still come on second page? maybe next to current fig2}.
% \fy{i feel like maybe other
%   teaser plot could still come on second page? maybe next to current
%   fig2}.

%%  \fy{feel like being too specific here could hurt us - in particular it sounds transductive without needing to be}
%% in the SSND
%% setting, in which models are trained to not only fit the training data but also
%% artificially labeled samples from the unlabeled set, while aiming to achieve
%% good ID validation accuracy.


In this paper, we introduce a new method that successfully leverages unlabeled
data to obtain diverse ensembles for novelty detection. Our
contributions are as follows:

\begin{itemize}[leftmargin=*]

  \item We propose to find Ensembles with Regularized Disagreement (ERD), that
    is, disagreement only on OOD data. Our algorithm produces ensembles just
    diverse enough to be used for novelty detection with a disagreement test
    statistic (Section~\ref{sec:method}).

  \item We prove that training with early stopping leads to regularized
    disagreement, for data that satisfies certain simplifying assumptions
    (Section~\ref{sec:earlystopping}).

  \item We show experimentally that $\method$ significantly outperforms existing
    methods on novelty detection tasks derived from standard image data sets, as
    well as on medical image benchmarks (Section~\ref{sec:experiments}).

\end{itemize}

    % \item We introduce a method that successfully leverages unlabeled data to
    % output Ensembles with Regularized Disagreement ($\method$) that can then
    % be used for novelty detection using a disagreement metric
    % (Section~\ref{sec:method})
  %% , an SSND \fy{repeat, also this bullet point seems redundant}
  %% method to obtain diverse ensembles in the SSND setting, in which models are
  %% trained to not only fit the training data but also artificially labeled
  %% samples from the unlabeled set (Section~\ref{sec:method}).
    %   , while aiming to achieve good ID validation accuracy.

% \item We that uses regularization to successfully take advantage of the
%   unlabeled data in order to obtain models that disagree on unseen OOD samples.
%   (Section 3)
% 
% \item We prove how early stopping regularization, a key ingredient of our
%   method, leads to $\method$ ensembles that disagree only on OOD samples, for
%   data that satisfies certain simplifying assumptions
%   (Section~\ref{sec:earlystopping}).
% 
% \item We show that OOD detection with $\method$ significantly
%   outperforms existing OOD methods on near OOD tasks derived from
%   standard image data sets, as well as on medical image benchmarks
%   (Section~\ref{sec:experiments}).


%% \vspace{-0.2cm}
%% \begin{itemize}[leftmargin=*]

%% \item We argue that regularizing disagreement is crucial for OOD detection with
%%   ensemble methods.

%% \item We derive from rigorous theoretical arguments a method that achieves the
%%   right amount of disagreement by training models with early stopping to fit
%% %   We give a formal justification for why training with early stopping to fit
%%   artificial labels assigned to the unlabeled set.

%% \item We test our method on many near OOD tasks, including medical data,
%%   demonstrating significant gains with a negligible increase in computation cost
%%   compared to vanilla ensembles.

%% \end{itemize}

%% The paper is organized as follows: In Section~\ref{sec:setting}, we
%% position our paper to related work by providing a taxonomy of existing
%% approaches with respect to data availability and the final
%% objective. We then motiviate our method from first principles and
%% provide intuition for compared to previous works in
%% Section~\ref{sec:method}. Section~\ref{sec:experiments} contains comparisons with
%% a plethora of different OOD methods 
%% on a variety of OOD detection scenarios and a
%% discussion in Section~\ref{sec:experiments} before concluding the
%% paper with a short discussion on possible future avenues in
%% Section~\ref{sec:conclusion}.  \fy{perhaps add discussion to
%%   conclusion, sounds a bit more exciting}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "iclr2021_conference"
%%% End:
