\section{Conclusion}
\label{sec:conclusion}

% In this section, we discuss the possible limitations of using $\method$ in practice before summarizing the advantages compared to other methods and outlining potential avenues for future work.

% \paragraph{Discussion on potential limitations of $\method$ and SSND.} First, as
% mentioned in Section~\ref{sec:earlystopping}, $\method$ assumes that OOD samples
% are sufficiently represented in the unlabeled set. Even though this may not be
% realistic for anomaly detection where outliers are rare, it may very well be
% feasible when samples from a novel class (such as a rapidly spreading disease)
% appear in large numbers at inference time.  We empirically investigate the
% impact on the performance of $\method$ of the size and the ratio of OOD samples
% in the unlabeled set ($\frac{|\targetoodset|}{|\targetidset| +
% |\targetoodset|}$).
% %% In addition, we also vary the ratio of OOD samples in the
% %% unlabeled set, i.e.\ $\frac{|\targetoodset|}{|\targetidset| +
% %%   |\targetoodset|}$.
% We find that there is a broad spectrum of values for which $\method$ maintains a
% good performance, as indicated in Figure~\ref{fig:vary_target_main} (see also
% Appendix~\ref{sec:appendix_vary_ood_ratio}). Moreover, in
% Appendix~\ref{sec:appendix_learning_curves} and~\ref{sec:appendix_score_curves}
% we provide insights as to how near OOD data affects the performance of
% $\method$.
% 
% Another limitation concerns the general applicability of SSND: the OOD data in
% the unlabeled set used for fine-tuning needs to match the OOD data at test time.
% Moreover, the SSND setting is not tailored to online (real-time) OOD detection.
% In the following we argue using an example that i) it may be inherently valuable
% to predict the OOD samples in the unlabeled set itself; \fy{to avoid confusion
% with transductive, perhaps focus on delay is ok? i.e. addressing the online
% comment and the other addresses the different OOD comment?} and ii) $\method$
% allows for a different test OOD distribution, at the cost of a slight delay. 
% 
% Consider, for instance, a medical center that uses an automated system for
% real-time diagnosis and an offline system which runs at the end of each week,
% for novelty detection. All the X-rays collected during the week
% % together with all previous unlabeled data
% constitute the unlabeled set $\targetset$ that the SSND method may then use for
% training. If a quickly spreading novel disease circulates the patient
% population, the detection model can identify the OOD samples that are then shown to
% the scarcely available experts.  While the experts are examining the peculiar X-rays
% in the course of the next week, the model helps to collect more instances of the
% same new condition and can already encourage clinicians to practice extra caution when
% diagnosing these patients. Since the novelty detection algorithm is run every week,
% new diseases are identified with a delay of at most a week -- the time it
% takes to collect an unlabeled set. Since new diseases emerge seldomly and the
% benefits of even delayed identification greatly outweigh the waiting time, SSND
% approaches are particularly suitable to practical scenarios like this.
% % when new diseases arrive (which does not happen often), they will only be detected
% % with a delay of a week.  \fy{refactor sentence}.

%% identifies any new diseases, then it
%% can also be used without retraining to flag new patients suffering
%% from the novel disease. We note that, in applications where accurate
%% novelty detection is critical, the cost of a delayed detection
%% (e.g.\ the time needed to collect a batch of unlabeled X-rays) is
%% justified by the substantially better performance of algorithms that
%% leverage unlabeled data.


%% We now discuss some shortcomings of existing OOD detection approaches closely
%% related to ours and indicate how our method attempts to address them.  Firstly,
%% vanilla ensembles use only the stochasticity of the training process and the
%% random initialization to obtain diverse models, but this often leads to similar
%% classifiers, that predict the same incorrect label on OOD data \cite{hein}.
%% Secondly, in the absence of proper regularization, optimizing the MCD objective
%% leads to models that agree to a similar extent on both ID and OOD data so that
%% one cannot distinguish them from one another (as indicated by low AUROC scores).
%% Furthermore, nnPU does not exploit all the signal in the training set and
%% discards the labels of the ID data.

%% $\method$ successfully diversifies an ensemble on OOD data by using the
%% unlabeled set and without requiring additional information about the test
%% distribution (e.g.\ unlike nnPU which requires the true OOD ratio). We identify
%% the key reasons behind the good performance of our approach to be as follows:
%% 1)~utilizing the labels of the ID training data and the complexity of deep
%% neural networks to diversify model outputs on OOD data; 2)~choosing an
%% appropriate disagreement score that draws on ensemble diversity; 3)~employing
%% early stopping regularization to prevent diversity on ID inputs.

% \paragraph{Summary and future work.}
In summary, we propose an SSND procedure that exploits unlabeled data
effectively to generate an ensemble with \emph{regularized} disagreement, which
achieves remarkable novelty detection performance. Our SSND method
does not need labeled OOD data during training unlike many other
related works summarized in Table~\ref{table:taxonomy}. 
%% We would like to stress once
%% again that a significant advantage of the SSND setting is that it does not
%% require any labeled or oracle OOD data during fine-tuning, unlike other
%% related works summarized in Table~\ref{table:taxonomy}. 

We leave as future work a thorough investigation of the impact of the
labeling scheme of the unlabeled set on the sample complexity of the method, as
well as an analysis of the trade-off governed by the complexity of the model
class.

% \section{Conclusion}
% \label{sec:conclusion}
% Reliable OOD detection is essential in order to deploy classification systems in
% safety-critical environments. We propose a procedure that succeeds in generating an ensemble \emph{regularized} disagreement, that is models that only disagree on OOD data.  We successfully leverage
% unlabeled data by fine-tuning the models in the ensemble to fit diverse classes on the unlabeled test. Early stopping then enables us to It outperforms
% state-of-the-art methods that also have access to a mixture of ID and unknown
% OOD samples, but also approaches that use known OOD data for training.
% As future
% work, we propose an investigation into the influence of the labeling scheme of
% the unlabeled set on the sample complexity of the method, as well as an analysis
% of the trade-off governed by the complexity of the model class of the
% classifiers.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "iclr2021_conference"
%%% End:
