% \vspace{-0.2cm}
\section{Experimental results}
\label{sec:experiments}

In this section we evaluate the novelty detection performance of $\method$ with
deep neural networks on several image data sets. On difficult near OOD data
sets, we find that our approach outperforms all baselines, including SSND
methods, but also methods operating in other, sometimes more favorable settings.
In addition, we discuss some of the trade-offs that impact $\method$'s
performance.

% \vspace{-0.2cm}
\subsection{Data sets}

%% We report results on near and far OOD detection scenarios using standard image data
%% sets and a recent OOD detection benchmark for medical images
%% \citep{Cao2020} (see Appendix~\ref{sec:appendix_datasets} for details).

\at{NEW: added clarification re what's special about novelty detection} Our
experiments focus on novel-class detection scenarios where the ID and OOD data
share many similar features and only differ in a few characteristics.
% \fy{have we resolved this novel class vs.  novelty thing somewhere?}
We use standard image data sets (e.g.\ CIFAR10/CIFAR100) and consider half of
the classes as ID, and the other half as novel. We also assess $\method$'s
performance on a medical image benchmark \citep{Cao2020}, where near OOD data
consists of novel unseen diseases (e.g.\ X-rays of the same body part from
patients with different conditions; see Appendix~\ref{sec:appendix_datasets} for
details). Further, we also include far OOD data sets (e.g.\ CIFAR10/CIFAR100 vs
SVHN) for completeness.
% since these are the settings considered in the majority of the literature and
% on which most baselines perform well.

%% \paragraph{Easy/Far OOD data.} ID and OOD samples come from strikingly different
%% data sets (e.g.\ CIFAR10 or CIFAR100 as ID and SVHN as OOD). These are the
%% settings considered in the majority of the literature and on which most
%% baselines perform well.

%% \paragraph{Hard/Near OOD data.} The OOD data consists of ``novel'' classes that
%% resemble the ID samples. For the standard image data sets we consider half of
%% the classes as ID, and the other half as OOD. For the medical benchmarks near
%% OOD data consists of unseen diseases. The similarities between the ID and the
%% OOD classes make these settings significantly more challenging.

%% Apart from using these canonical data sets, we also compare the performance of
%% our method on more realistic data, namely a recently proposed OOD detection
%% benchmark for medical imaging \cite{Cao2020}. This benchmark contains a suite of
%% data sets that cover three categories of difficulty, as detailed in
%% Appendix~\ref{sec:appendix_medical}.

% \vspace{-0.5cm} Appendix \ref{sec:appendix_hardness} provides more insight on
% OOD detection hardness, while Appendix \ref{sec:appendix_datasets} presents
% examples of images for the various settings. We are not too interested in the
% practical scenario of covariate shift \cite{Shimodaira2000} when the
% distributions are so close that domain adaptation techniques could perform
% well. Having said that, we show in
% Appendix~\ref{sec:appendix_more_experiments} that our method successfully
% identifies samples coming from the target distribution, even for mild shifts,
% providing further evidence that $\method$ is particularly well suited for the
% most difficult of OOD detection tasks.

For all scenarios, we used a labeled training set (e.g.\ 40K samples for
CIFAR10), a validation set with ID samples (e.g.\ 10K samples for CIFAR10) and
an unlabeled test set where half of the samples are ID and the other half are
OOD (e.g.\ 5K ID samples and 5K OOD samples for CIFAR10 vs SVHN). For
evaluation, we use a holdout set containing ID and OOD samples in the same
proportions as the unlabeled set. Moreover, in
Appendix~\ref{sec:appendix_small_test_set} we present results obtained with a
smaller unlabeled set of only 1K samples.


\begin{table*}[t]
\scriptsize
\centering

\caption{
%   \small{
    AUROC and TNR@95 for $\method$ and various baselines (we
    \bestnonreto{highlight} the best method for each data set). Numbers in square
    brackets indicate the ID/OOD classes. 
%     We highlight \bestreto{$\method$} and the \bestnonreto{best baseline}. 
    Asterisks mark methods proposed in this paper. Mahal, nnPU and MCD
($^\dagger$) use oracle information about the OOD data.
Repeated runs of $\method$ show a small variance $\sigma^2 < 0.01$ in
the detection metrics.
% }
}
% \vspace{-0.2cm}

\label{table:main_results}

\input{tables/hand_curated_main_results_table}
% \input{tables/main_results_table}

% \vspace{-0.4cm}
\end{table*}


\subsection{Baselines}

We compare our method against a wide range of
baselines that are applicable in the SSND setting.
%% that require different access to OOD data for training, as indicated
%% in Table~\ref{table:taxonomy}.


\paragraph{Semi-supervised novelty detection.} We primarily compare $\method$ to
SSND approaches that are designed to incorporate a small set of unlabeled ID and
novel samples.
%% is available.
%% that can be used in the semi-supervised setting for deep
%% neural networks, in which a small set of unlabeled ID and OOD samples
%% is available.

The \emph{MCD} method \citep{mcd_ood} trains an ensemble of two classifiers such
that 
%with different types of predictive distributions on the unlabeled samples:
one model gives high-entropy and the other yields low entropy predictive
distributions on the unlabeled samples. Furthermore, \emph{nnPU} \citep{Kiryo17}
considers a binary classification setting, in which the labeled data comes from
one class (i.e.\ ID samples, in our case), while the unlabeled set contains a
mixture of samples from both classes. Notably, both methods require oracle
knowledge that is usually unknown in the regular SSND setting: MCD uses test OOD
data for hyperparameter tuning while nnPU requires oracle knowledge of the ratio
of OOD samples in the unlabeled set.

In addition to these baselines, we also propose two natural extensions to the
SSND setting of two existing methods.  Firstly, we present a version of the
Mahalanobis approach (\emph{Mahal-U}) that is calibrated using the unlabeled
set, instead of using oracle OOD data. Secondly, since nnPU requires access to
the OOD ratio of the unlabeled set, we also consider a less burdensome
alternative: a \emph{binary classifier} trained to separate the training data
from the unlabeled set and regularized with early stopping like our method.

\paragraph{Unsupervised novelty detection (UND).} Naturally, one may ignore the
unlabeled data and use unsupervised approaches. 
% Although the comparison with UND methods is not entirely fair, we include it
% in our analysis for the sake of completeness.
The current SOTA UND method on the usual benchmarks is the \emph{Gram method}
\citep{gram_ood}. Other notable UND approaches include \emph{vanilla ensembles}
\citep{balaji}, deep generative models (which tend to give undesirable results
for OOD detection \citep{Kirichenko20}), or various Bayesian approaches (which
are often poorly calibrated on OOD data \citep{ood_ovadia}).

Preliminary analyses revealed that generative models and methods trained with a
contrastive loss \citep{winkens2020} or with one-class classification
\citep{sohn2021} perform poorly on near OOD data sets (see
Appendix~\ref{sec:appendix_cifar10_cifar100} for a comparison; we use numbers
reported by the authors for works where we could not replicate their results).

%% \fy{not relevant here}
%% Despite using less information
%% than what is available in the SSND setting, some UND methods that use \emph{no
%% OOD} data for training tend to outperform previously proposed SSND approaches.

\paragraph{Other methods.} We also compare with \emph{Outlier Exposure}
\citep{outlier_exposure} and \emph{Deep Prior Networks (DPN)} \citep{dpn} which use
TinyImages as known outliers during training, irrespective of the OOD set used
for evaluation. On the other hand, the \emph{Mahalanobis} baseline \citep{mahalanobis}
is tuned on samples from the same OOD distribution used for evaluation.
Finally, we also consider large transformer models pretrained on ImageNet21k and
fine-tuned on the ID training set \citep{fort2021}.



% \vspace{0.1cm}
\subsection{Implementation details}
\label{sec:erd_eval}

\paragraph{Baseline hyperparameters.}
% For the results in this section we focus on the transductive setting, i.e.\
% OOD detection is performed on the same unlabeled set that is used to tune the
% method. In Appendix~\ref{sec:appendix_semi_supervised} we consider the
% semi-supervised setting with a holdout test set and we show that the AUROC of
% our method is within $0.01$ from the one reported in
% Table~\ref{table:main_results}.
For all the baselines, we use the default hyperparameters suggested by their
authors on the respective ID data set (see
Appendix~\ref{sec:appendix_experiments} for more details). For the binary
classifier, nnPU, ViT, and vanilla ensembles, we choose the hyperparameters that
optimize the loss on an ID validation set.
% We pick $K=5$ for the size of Vanilla Ensembles and note that with the
% exception of Vanilla Ensembles, nnPU and the binary classifier, all baselines
% use WideResNet-28-10 architectures.  We defer the details regarding training
% the models to Appendix~\ref{sec:appendix_experiments}.

\begin{figure*}[t]
  \begin{subfigure}[t]{0.49\textwidth}
    \centering
    \includegraphics[width=0.8\textwidth]{figures/avg_medical_ood_main_text.png}
    \caption{Novelty detection performance on medical data}
    \label{fig:avg_medical_ood_main_text}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.49\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/heatmap_cifar10:0,1,2,3,4_vs_cifar10:5,6,7,8,9_ensemble3_holdout.png}
    \caption{Effect of OOD proportion on detection}
    \label{fig:vary_target_main}
  \end{subfigure}

% \vspace{-0.1cm}
\caption{
%   \small{
    \textbf{Left:} AUROC averaged over all scenarios
    in the medical novelty detection benchmark. The values for the baselines are
    computed using the code from \citet{Cao2020}. \textbf{Right:} The AUROC of a
    3-model $\method$ ensemble as the number and proportion of ID (CIFAR10[0:4])
and OOD (CIFAR10[5:9]) samples in the unlabeled set are varied (see also
Appendix~\ref{sec:appendix_vary_ood_ratio}).
% }
}
\vspace{-0.2cm}

\end{figure*}



\paragraph{$\method$ details.}\footnote{Our code is publicly available at
\href{https://github.com/ericpts/ERD}{https://github.com/ericpts/ERD}.}
% \fy{maybe refer to steps in algo 1 to be bulletproof clear}
We follow the procedure in Algorithm~\ref{algo:reto_training} to fine-tune each
model in the $\method$ ensemble starting from weights that are pretrained on the
labeled ID set $\sourceset$.\footnote{In the appendix we also train the models
  from random initializations, i.e.\ $\method$++, and obtain better novelty
detection at the cost of more training iterations.}
% \fy{Be more direct (merge with commented out old version):}
Unless otherwise specified, we train $K=3$ ResNet20 networks \citep{He2015} using
3 randomly chosen class labels for $\labeledtarget$ and note that even ensembles
of two models produce good results (see
Appendix~\ref{sec:appendix_ensemble_size}). We stress that whenever applicable,
our choices disadvantage $\method$ for the comparison with the baselines, e.g.\
vanilla ensembles use $K=5$, and for most of the other approaches we use the
larger WideResNet-28-10. 
% For all experiments \fy{even medical?}, each model in the ensemble is based on
% a ResNet-20 \citep{He2015}.  \fy{1.  Reason for choosing those (e.g. $K=5$
% doesn't do better) 2. Note that wherever applicable this disadvantages us for
% comparison with some related methods, e.g.  blabla} \fy{didn't get the ``when
% fair comparison is not possible'' phrase, see comment in comments}
We select the early stopping time and other standard hyperparameters so as to
maximize validation accuracy.

%% When a fair comparison with all the
%% baselines is not possible, we err on the side of disadvantaging our
%% method. Hence, we train $\method$ ensembles of size $K=3$ (smaller
%% than $K=5$ used for Vanilla Ensembles) and note that even ensembles of
%% two models produce good results (see
%% Appendix~\ref{sec:appendix_ensemble_size}). \fy{not entirely sure what
%%   you mean by fair comparison - i know it has to do with reviewer but
%%   don't get the point here. what does ``not possible'' mean? due to
%%   data availability or computationally or what? like - why can't we
%%   use wideresnet} Moreover, we use ResNet20 models \citep{He2015},
%% which are significantly smaller than the WideResNet-28-10 used by most
  %% baselines.
  
%% We choose the arbitrary label assigned to the unlabeled set at random, without
%% replacement.
% (see Appendix~\ref{sec:appendix_ensemble_size} for a discussion on the impact
% of the choice of arbitrary label). \fy{can maybe unbracket and merge these two
% brackets}

% Analogous to the parameter choices for the baselines, we pick hyperparameters
% such as the early stopping time and other standard training parameters, to
% maximize validation accuracy.
% \fy{why can't you put all in one sentence?}
  
%% For each model in the ensemble we perform post-hoc early stopping: we train for
%% 10 epochs and select the iteration with the lowest validation loss. The other
%% hyperparameters for training are chosen to maximize validation accuracy on the
%% ID data. \fy{what reads weird for me here is that early stopping time also uses validation acc ..., maybe rephrasing helps it seem less weird}

\paragraph{Evaluation.} As in standard hypothesis testing, choosing different
thresholds for rejecting the null hypothesis leads to different false positive
and true positive rates (FPR and TPR, respectively). The ROC curve follows the
FPR and TPR for all possible threshold values and the area under the curve
(AUROC; larger values are better) captures the performance of a statistical test
without having to select a specific threshold. In addition, we also report the
TNR at a TPR of 95\% (TNR@95; larger values are better). These metrics evaluate
the quality of an outlier score without choosing a rejection threshold. However,
we note that this problem can easily be addressed in practice. For instance, one
can choose the threshold so as to achieve a desired FPR, which can be estimated
using a validation set of ID samples.\footnote{Alternatively, the work of
  \citep{Liu18} proposes a criterion for selecting the threshold, tailored
  specifically to the SSND setting. This method uses the unlabeled set and the
known ID data to estimate the distribution of outlier scores for OOD points.}
% For evaluation, we use two metrics that are common in the OOD detection
% literature: the area under the ROC curve (AUROC; larger values are better) and
% the true negative rate at a true positive rate of 95\% (TNR@95; larger values
% are better).

\paragraph{Computation cost.}
% \fy{somehow find absolute minute times not that meaningful, ideally more
% comparative - maybe computational cost is better title. also maybe put it in
% context of the example again} \fy{would say sth like:}
We only need to fine-tune two-model ensembles to get good performance with
$\method$ (see Appendix~\ref{sec:appendix_ensemble_size}). For instance, in
applications like the one in Figure~\ref{fig:practical_sketch}, $\method$
fine-tuning introduces little overhead and works well even with scarce resources
(e.g.\ it takes around 5 minutes on 2 GPUs for the settings in
Table~\ref{table:main_results}).
% \fy{here maybe have to discuss in case its hard to fit - then we actually stop
% using a better rule and hence save?} \at{see comment above re this}
In contrast, other ensemble diversification methods require training different
models for each hyperparameter choice and have training losses that cannot be
easily parallelized (e.g.\ \citet{mcd_ood}). Moreover, the only other approach
that achieves comparable performance to our method on \emph{some} near OOD data 
uses large transformer models pretrained on a large and conveniently chosen
data set \citep{fort2021}.
% \at{maybe say that it takes 1hr to tune}


% \vspace{-0.2cm}
\subsection{Main results}
% \vspace{-0.1cm}

\at{NEW: added comments on ViT results} We summarize the main empirical results
in Table~\ref{table:main_results}.  While most methods achieve near-perfect
detection for far OOD, $\method$ has a clear edge over the baselines for
novel-class detection within the same dataset -- even compared to methods
($\dagger$) that use oracle OOD information. For completeness, we present in
Appendix~\ref{sec:appendix_cifar10_cifar100} a comparison with more related
works.  These methods either show unsatisfactory performance on near OOD tasks,
or seem to work well only on certain specific data sets.
% \fy{given previous sentence maybe don't have to bash Vit exactly here? seems
% redundant with what comes later} In particular, we note that large pretrained
% transformers \citep{fort2021} detect novel-class samples well when ID and OOD
% classes (e.g.\ CIFAR10/CIFAR100) are represented in the pretraining data
% (i.e.\ Imagenet21k), but perform rather poorly \fy{in all other problems ID vs
% OOD scenarios}.
We elaborate on the potential causes of failure for these works in
Section~\ref{sec:setting}.

% We conjecture that the pretraining data helps to learn features that separate
% well CIFAR10[0:4] from CIFAR10[5:9], but the model projects all SVHN digits to a
% concentrated region in feature space, making them difficult to distinguish with
% only weak supervision from the ID labels.

%% the difficult novelty detection scenarios (bottom part), $\method$ has a clear
%% edge over the baselines, even when they are calibrated on \emph{oracle OOD}
%% data, or when they use the true OOD ratio of the unlabeled set, e.g.\ nnPU.
% \fy{next sentence doesn't add anything anymore} The substantial gap between
% $\method$ and other approaches, both in average AUROC and average TNR@95,
% indicates that our method lends itself well to practical situations when
% accurate OOD detection is critical.


% Moreover, we show in Appendix~\ref{sec:appendix_more_experiments} that our
% method successfully identifies as OOD even samples with covariate shift
% \cite{Shimodaira2000}, which are so similar to the original data that domain
% adaptation techniques can perform well. In these scenarios, flagging all the
% samples from the target distribution as OOD is undesirable,\footnote{Ideally,
% we want to use domain adaptation whenever possible, and only detect
% problematic samples.} but it provides further evidence that $\method$ is
% particularly well-suited for the most difficult of OOD detection tasks.
% \at{maybe cut from here}

For the medical novelty detection benchmark we show in
Figure~\ref{fig:avg_medical_ood_main_text} the average AUROC achieved by some
representative baselines taken from \citet{Cao2020}. Our method improves the
average AUROC from $0.85$ to $0.91$, compared to the best baseline.
We refer the reader to \citet{Cao2020} for precise details on the methods.
Appendix~\ref{sec:appendix_medical} contains more results, as well as additional baselines.

% \fy{here i realize that this mixing of OOD and novel class might be confusing
% in particular the way we led with the example makes it less normal OOD
% problem, but really (SS)ND - should have a discussion on OOD vs. novelty/novel
% class somewhere but i think it would be cleaner (for people not to fall back
% to traditional OOD scenarios) to  stick with ND?}


% \vspace{-0.2cm}
\subsection{Ablation studies and limitations}
\label{sec:ablations}
% \vspace{-0.1cm}


We also perform extensive experiments to understand the importance of specific
design choices and hyperparameters, and refer the reader to the appendix for
details.

\at{NEW: moved here all discussion of different OOD and cov shift}
\paragraph{Relaxing assumptions on OOD samples.} In
Table~\ref{table:main_results} we evaluate our approach on a holdout test set
that is drawn from the same distribution as the unlabeled set $\targetset$ used
for fine-tuning. However, we provide experiments in
Appendix~\ref{sec:appendix_different_ood} that show that novelty detection with
$\method$ continues to perform well even when the test set and $\targetset$ come
from different distributions (e.g.\ novel-class data in the test set also
suffers from corruptions).
% Finally, in Appendix~\ref{sec:appendix_transductive_results} we show that
% $\method$ also successfully identifies the novel samples from the unlabeled
% set used for fine-tuning, in a transductive setting.  \at{TODOTODO} \fy{not so
% sure where to put this, cause datasets weren't mentioned earlier, nobody
% expects it here}
Further, even though our main focus is novel-class detection, our experiments
(Appendix~\ref{sec:appendix_cov_shift}) indicate that $\method$ can also
successfully identify near OOD samples that suffer from only mild covariate
shift compared to the ID data (e.g. CIFAR10 vs corrupted CIFAR10 \citep{cifar_c}
or CIFAR10v2 \citep{recht}).  Finally,
Appendix~\ref{sec:appendix_transductive_results} shows that $\method$ ensembles
also perform well in a transductive setting \citep{scott08}, where the test set
coincides with $\targetset$.

\paragraph{Relaxing the assumptions of Proposition~\ref{proposition_informal}.}
Our theoretical results require that the ID classes in the training set and
the novel classes in $\targetset$ have similar cardinality. In fact, this
condition is unnecessarily strong as we show in our empirical analysis: In all
experimental settings we have significantly fewer OOD than ID training points.
% 
% is clusterable in balanced clusters. However, we show in our experiments that
% the intuition outlined in Proposition~\ref{proposition_informal} also holds
% true when the number of novel-class samples is significantly smaller, as is
% the case in real-life applications.
We further investigate the impact of the size of the unlabeled set and of the
ratio of novel samples in it ($\frac{|\targetoodset|}{|\targetidset| +
|\targetoodset|}$) and find that $\method$ in fact maintains good performance
for a broad range of ratios in Figure~\ref{fig:vary_target_main}.
% However indeed OOD shouldn't be too small, otherwise it may be ignored and
% artificially labeled ID starts fitting before OOD.



% To further illustrate the remarkable advantages of $\method$ ensembles, we
% show proof-of-concept experiments which indicate that our method can be used
% beyond novel-class detection (Appendix~\ref{sec:appendix_cov_shift}). In
% particular, it successfully identifies OOD samples with mild covariate shift,
% strikingly similar to ID data (e.g.\ corrupted CIFAR10 \citep{cifar_c},
% CIFAR10v2 \citep{recht}).

% used beyond novel class detection: \fy{maybe here something like
% $P_x < \alpha$?} it can even successfully identify OOD samples with mild
% distribution shift (e.g.\ corrupted CIFAR10 \citep{cifar_c}, CIFAR10v2
% \citep{cifar10v2}, ObjectNet \citep{Barbu2019}), which provides further evidence
% that $\method$ is well-suited for the most difficult of OOD detection tasks.


%% In addition, we also vary the ratio of OOD samples in the
%% unlabeled set, i.e.\ $\frac{|\targetoodset|}{|\targetidset| +
%%   |\targetoodset|}$.


% \paragraph{Choice of disagreement metric and importance of regularized
% ensembles}
% 
% \fy{merge the below} First we show that both ERD and the disagreement metric
% are crucial for good OOD detection (see Table~\ref{table:score_comparison} in
% Appendix~\ref{sec:appendix_statistic}).  The substitution of either with a
% more common choice \fy{this shouts for MCD with the same...? motivation to
% ablate disagreement metric? - i find this not so important anymore after the
% very very detailed explanation earlier ...?i would maybe just add it in the
% discussion there}

% \fy{in some way these are all ablations so I got confused why here you have 1,2 that to me sounds quite different}

% both the training procedure \fy{which part? as such, sentence has no meaning}
% for $\method$ ensembles and the disagreement score introduced in
% Section~\ref{sec:disagreement} are crucial for good OOD detection (see
% Table~\ref{table:score_comparison} in Appendix~\ref{sec:appendix_statistic}).

\begin{table*}[t]
  \scriptsize
  \centering

  \caption{Taxonomy of novelty detection methods, categorized according to data
    availability (\textbf{horizontal axis}) and probabilistic perspective
    (\textbf{vertical axis}). We
  \ensemble{highlight} the ensemble-based methods.}
    \label{table:taxonomy}

  \input{tables/taxonomy.tex}
%   \vspace{-0.4cm}

\end{table*}



\paragraph{Sensitivity to hyperparameter choices.} We point out that $\method$
ensembles are particularly robust to changes in the hyperparameters like batch
size or learning rate (Appendix~\ref{sec:appendix_lr_bs}), or the choice of the
arbitrary label assigned to the unlabeled set
(Appendix~\ref{sec:appendix_ensemble_size}).  Further, we note that $\method$
ensembles with as few as two models already show remarkable novelty detection
performance and refer to Appendix~\ref{sec:appendix_ensemble_size} for
experiments with larger ensemble sizes.
% We refer to Appendix~\ref{sec:appendix_different_arch} for results with
% different architectures and to Appendix~\ref{sec:appendix_ensemble_size} for a
% discussion on the impact of the ensemble size and the choice of the arbitrary
% label.
Moreover, $\method$ performance improves with larger neural networks
(Appendix~\ref{sec:appendix_different_arch}), meaning that $\method$ will
benefit from any future advances in architecture design. 

\paragraph{Choice of disagreement score.} We show in
Table~\ref{table:score_comparison} in Appendix~\ref{sec:appendix_statistic},
that the training procedure alone (Algorithm~\ref{algo:reto_training}) does not
suffice for good novelty detection. For optimal results, $\method$ ensembles
need to be combined with a disagreement-based score like the one introduced in
Section~\ref{sec:disagreement}.
% Finally, in Appendix~\ref{sec:appendix_learning_curves}
% and~\ref{sec:appendix_score_curves}
Finally, we show how the distribution of the disagreement score changes during
training for $\method$ (Appendix~\ref{sec:appendix_score_curves}) and explain
why regularizing disagreement is more challenging for near OOD data, compared to
easier, far OOD settings (Appendix~\ref{sec:appendix_learning_curves}).

\at{NEW: 1) SSND for online; 2) anomaly detection; 3) near OOD is more
  challenging for ERD too}
\paragraph{Limitations.} Despite the advantages of $\method$, like all prior SSND methods,
our approach
% is most suitable for situations when inference-time data is available in
% batches, and 
is not a good fit for online (real-time) novelty detection tasks.  Moreover,
$\method$ ensembles are not tailored to anomaly detection, where outliers are
particularly rare, since the unlabeled set should contain at least a small
number of samples from the novel classes (see
Figure~\ref{fig:vary_target_main} and Appendix~\ref{sec:appendix_vary_ood_ratio}). However, $\method$ ensembles are an ideal
candidate for applications that require highly accurate, offline novelty
detection, like the one illustrated in Figure~\ref{fig:practical_sketch}.
% Our method shows significantly improved detection performance in the SSND
% setting, which is relevant for practical applications like the one illustrated
% in Figure~\ref{fig:practical_sketch}.  However, caution is in order when using
% $\method$ beyond its intended scope.
% \fy{honestly this point is kinda strange - its more challenging for everybody, no? how is that part of limitation of ERD? i didn't really get whether this next one is positive or negative}
% Finally, despite $\method$ achieving substantially better novelty detection
% compared to the baselines, regularizing disagreement on novel samples is more
% challenging when ID and OOD data are similar, as we explain in
% Appendix~\ref{sec:appendix_learning_curves}
% and~\ref{sec:appendix_score_curves}.

% Our method solves the SSND setting that is relevant in practice
% like in the leading example, there are two notes of caution ... 
% \fy{however caution is in order when trying
% to use it beyond what it's made for}:
% Like all SSND methods, our method is most
% suitable for situations when inference-time data is available in
% batches (see Figure~\ref{fig:practical_sketch}) and not tailored to online (real-time) OOD detection.
% % First, as mentioned in Section~\ref{sec:earlystopping}, $\method$ assumes that
% As seen in Figure~\ref{} albeit being not as restrictive as
% theory suggests, we require OOD
% samples to be sufficiently represented in the unlabeled set. This may not
% be realistic for anomaly detection where outliers are very rare.
% 
% \at{add this in here} However indeed OOD shouldn't be too
% small, otherwise it may be ignored and artificially labeled ID starts fitting
% before OOD.
% 
% \at{say that it can be applied on different OOD data (point to appendix with
% CIFAR-C stuff)}
% 
% \fy{either this way or end on a high note ... not sure}
% 
% 
% \at{move this in next par}
% Moreover, in Appendix~\ref{sec:appendix_learning_curves}
% and~\ref{sec:appendix_score_curves}, we provide insights as to how near OOD data
% affects the performance of $\method$. \fy{don't know what to expect when i read this}.

% In addition, in Appendix~\ref{sec:appendix_cost} we show the dependence of the
% OOD detection performance on the ensemble size for $\method$ and argue that
% the AUROC scales more favorably with the number of models, compared to vanilla
% ensembles.

% The evaluation for the scenarios presented in Table~\ref{table:main_results} is
% performed on the same test set that was used for training, as usual in
% transductive learning. In addition to that, the OOD detection performance of
% $\method$ extrapolates well to unseen samples from the same
% distribution.\footnote{This setting is similar for instance to the one in the
% Mahalanobis baseline, which assumes oracle knowledge of the OOD distribution at
% training time.} In order to show this, we run experiments in which we
% compute the AUROC on a hold-out test set drawn from the same ID and OOD
% data sets as the ones used during training. The AUROC on the hold-out test
% set is within 0.01 from the one calculated on the test set observed during
% training (see Appendix~\ref{sec:appendix_holdout} for more details).

% In the cases when either the size of the test set or the test OOD ratio is
% small, the OOD detection performance deteriorates to the point where it is
% comparable to vanilla ensembles, as shown in Figure~\ref{fig:vary_target_main}
% where we report the gap in AUROC between $\method$ and a vanilla ensemble.
% This loss in efficacy can be mitigated by either splitting the test set in
% smaller batches, or by using a different labeling scheme for the test set, the
% details of which we leave as future work.




%%% Local Variables:
%%% mode: latex
%%% TeX-master: "iclr2021_conference"
%%% End:
