% \documentclass{uai2024}
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{graphicx} % Required for inserting images
\usepackage{authblk}
\usepackage{amsmath}
\usepackage[inline]{enumitem}
\usepackage{subfig}
\usepackage{array}
\usepackage{makecell}
\usepackage{amssymb}
\usepackage{xcolor}
\usepackage{color,soul}
\usepackage{algorithm,algpseudocode}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{lipsum}
\usepackage{tabularray}
\usepackage{caption}
\usepackage{mathtools}

\newcommand\crule[3][black]{\textcolor{#1}{\rule{#2}{#3}}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}

\newtheorem{definition}{Definition}
\newtheorem{rem}{Remark}
\DeclareMathAlphabet{\mathcal}{OMS}{cmsy}{m}{n}

\renewcommand\theadalign{bc}
\renewcommand\theadfont{\bfseries}
\renewcommand\theadgape{\Gape[4pt]}
\renewcommand\cellgape{\Gape[4pt]}

\usepackage{adjustbox}
\usepackage{graphicx}
\usepackage{pdfpages}

\definecolor{orangeExperiment}{RGB}{204,102,0}
\definecolor{blueExperiment}{RGB}{0,102,204}

\begin{document}

\title{Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network Monitoring}

\author[1, 2]{Khoi~Tran~Dang}
\author[3]{Kevin~Delmas}
\author[2]{Jérémie~Guiochet}
\author[2, 4]{Joris~Guérin}

% Add affiliations after the authors
\affil[1]{INSA Toulouse ~~ {$^2$}LAAS-CNRS, Univ. Toulouse ~~ {$^3$}ONERA, Toulouse, France}
\affil[4]{Espace-Dev, IRD, Univ. Montpellier, Montpellier, France}
\affil[ ]{\textit{tkdang@insa-toulouse.fr ; kevin.delmas@onera.fr ;  jeremie.guiochet@laas.fr ; joris.guerin@ird.fr}}

\maketitle


%    _   _         _                  _   
%   /_\ | |__  ___| |_ _ __ __ _  ___| |_ 
%  //_\\| '_ \/ __| __| '__/ _` |/ __| __|
% /  _  | |_) \__ | |_| | | (_| | (__| |_ 
% \_/ \_|_.__/|___/\__|_|  \__,_|\___|\__|
                                        
\begin{abstract}
With the increasing use of neural networks in critical systems, runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. The efficacy of these approaches is mostly evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization, this problem has received little attention. A few studies touch upon this question, but they typically assume that the runtime data distribution mirrors the training distribution, which is a strong assumption as monitors are supposed to safeguard a system against potentially unforeseen threats. In this work, we present rigorous experiments on various image datasets to investigate:
\begin{enumerate*}
    % \item Can we find a good threshold using solely in-distribution data? 
    \item The effectiveness of monitors in handling unforeseen threats, which are not available during threshold adjustments.
    \item Whether integrating generic threats into the threshold optimization scheme can enhance the robustness of monitors.
\end{enumerate*}
\end{abstract}

%   _____       _             
%   \_   \_ __ | |_ _ __ ___  
%    / /\| '_ \| __| '__/ _ \ 
% /\/ /_ | | | | |_| | | (_) |
% \____/ |_| |_|\__|_|  \___/ 
                                     
\section{Introduction}

Deep learning has gained traction in safety-critical domains such as surgical robots~\citep{Haidegger2019AutonomyFS}, autonomous vehicles~\citep{ferreira2022simood}, and drone landing~\citep{guerin2022evaluation}. As reliance on neural networks (NN) in these sectors intensifies, the importance of ensuring their safety keeps growing and demands continued research. NN runtime monitoring is a promising direction, seeking to detect unsafe predictions during inference. Numerous methods have been developed for NN runtime monitoring~\citep{hendrycks2016baseline,ferreira2023sena, wang2022vim}. They consist of designing scoring functions indicating the level of confidence for a prediction. These scores are then thresholded to reject low-confidence predictions.

\begin{figure*}[t]
    \centering
    \includegraphics[width=0.8\textwidth]{img/ICLR24summary.png}
    \caption{\textbf{Conceptual Overview} -- This research compares four ways to construct threshold optimization sets for neural network runtime monitors, each representing distinct assumptions about the data available for threshold tuning.}
    \label{fig:summary}
\end{figure*}

The performance of a monitor is assessed based on its capacity to build score distributions that effectively separate safe and unsafe predictions. To evaluate this, commonly used metrics in the literature are threshold-agnostic, representing an average performance of binary classification metrics across a range of threshold values (e.g., area under the receiver operating characteristic curve (AUROC)). High values of such metrics suggest the existence of a good threshold, but they do not ensure that it can be found easily. To deploy a monitor in a real-world application, a concrete rejection threshold value must be set to determine accepted and rejected predictions. This threshold is pivotal, as a good monitor with a poor threshold can still result in an unsafe system. Despite the crucial nature of threshold optimization, it remains under-explored in runtime monitoring research.

Building upon the foundational work of \citet{chow1970optimum}, the field of ``classification with rejection'' has considered the problem of rejection thresholds optimization \citep{geifman2017selective, zhang2023survey}. However, a strong assumption underlying most of these studies is that the data distribution encountered during runtime closely mirrors the training distribution. In practice, this means that the validation occurs on training and test datasets drawn from the same distribution. This presents a notable challenge in the context of neural network safety monitoring, where the primary objective is to safeguard critical systems against various types of threats, such as novel classes, covariate shifts, or adversarial attacks. In this paper, we aim to assess experimentally the resilience of runtime monitoring thresholds under different assumptions about our prior knowledge of runtime threats. This includes investigating scenarios that depart from the traditional assumption of distributional similarity, offering a broader coverage of diverse real-world conditions.

To address this pivotal question, we have designed a rigorous large-scale experiment on computer vision datasets. We compare four different ways to construct a threshold optimization dataset (see Figure~\ref{fig:summary} and Section~\ref{sec:research_questions}),  allowing us to investigate two primary research questions. First, we compare thresholds fitted with or without prior knowledge of the evaluated threat, which is a more realistic setting to determine whether NN monitors can effectively handle unforeseen new threats. Second, we explore the potential benefits of integrating generic threats into the threshold optimization dataset. Given the relative ease of generating or acquiring generic threats, this approach could represent a realistic straightforward method to enhance the robustness of neural network monitors.

% Contents of next sections

This paper is organized as follows: Section~\ref{sec:related_work} reviews relevant literature on NN monitoring and threshold optimization for classification. Section~\ref{sec:research_questions} presents our methodology and the associated research questions. Section~\ref{sec:Proposed experiments} outlines our experimental design. Section~\ref{sec:Results} analyzes our findings. In Section~\ref{sec:Discussion}, we reflect on our findings and explore their practical implications. Finally, in Section~\ref{sec:conclusion}, we conclude this work and suggest future research directions. 

%    __      _       _           _                      _    
%   /__\ ___| | __ _| |_ ___  __| | __      _____  _ __| | __
%  / \/// _ | |/ _` | __/ _ \/ _` | \ \ /\ / / _ \| '__| |/ /
% / _  |  __| | (_| | ||  __| (_| |  \ V  V | (_) | |  |   < 
% \/ \_/\___|_|\__,_|\__\___|\__,_|   \_/\_/ \___/|_|  |_|\_\
                                                           
\section{Background and Related Work}
\label{sec:related_work}

In this section, we present key definitions and relevant literature about NN runtime monitoring. This work focuses on classification, but some of the methodologies discussed here are transferable to other Machine Learning tasks. 

\subsection{Neural Networks Runtime Monitors}
\label{sec:nnrm}

Let us denote a classification task by $T$, its feature space by $\mathcal{X}$, and its label space by $\mathcal{Y}$. 
The oracle function for $T$ %on $\mathcal{X}$ 
is denoted $\Omega$, signifying that the ground truth for any $x \in \mathcal{X}$ is $\Omega(x)$.
Let $D_{\text{train}}$ represent a training dataset for $T$, and let $f$ be a classifier for $T$, trained using $D_{\text{train}}$. A runtime monitor for $f$, denoted as $m_f$, is a binary classifier designed to filter out unsafe predictions of $f$. Here, we adopt a convention where the positive class for $m_f$ denotes unsafe samples, though the reverse convention also exists in the literature. 

Most of the literature on NN monitoring does not focus on constructing binary classifiers, but rather on models that output continuous scores representing the confidence in a prediction. In practice, training a monitor, i.e., adjusting the parameters of the monitoring function to generate meaningful scores, commonly involves the use of the same labeled training dataset, $D_{\text{train}}$, although this is not a strict requirement. Converting these scores into binary classification outputs requires applying a thresholding operation.

The fitting method typically relies on features extracted from one or more layers of $f$. \citet{hendrycks2016baseline} proposed to detect abnormal examples using the maximum softmax probability (MSP) as their score. \citet{lee2018simple} fitted class-conditional Gaussian distributions to the features and defined their confidence score as the minimum Mahalanobis distance to class-wise centroids. \citet{Henzinger2019OutsideTB} compared runtime features to the smallest bounding boxes containing features from $D_{\text{train}}$. \citet{liu2020energy} proposed the energy score (\emph{logsumexp} of the logits) and \citet{sun2021react} suggested computing rectified logits by clipping the activations.  Recently, \citet{wang2022vim} developed a virtual logits score, generated from the norm of feature residuals against the principal subspace defined by $D_{\text{train}}$.

\subsection{Evaluation of NN Runtime Monitors}
\label{sec:related_eval}

\subsubsection{Out-of-Distribution vs. Out-of-Model-Scope} 

The concept of \emph{safety} is central in defining runtime monitors expected outcomes. Two perspectives coexist to define what constitutes an unsafe sample \citep{guerin2023outofdistribution}: 
\begin{enumerate}
    \item Out-of-Distribution (OOD): This perspective targets the detection of data points that fall beyond the training distribution of the classifier, represented by $D_{\text{train}}$.

    \item Out-of-Model-Scope (OMS): This perspective focuses on identifying data points that lead to incorrect predictions by the classifier.
\end{enumerate}

In this study, we adopt the OMS approach, where the monitor's objective is to reject misclassified samples, indicated as $m_f=0$ when $f(x) = \Omega(x)$ (correct prediction) and $m_f=1$ when $f(x) \neq \Omega(x)$ (misclassification). As explained by \citet{guerin2023outofdistribution}, the OMS setting circumvents the potentially ambiguous definition of what is OOD and avoids any misconceptions about OOD detection performance. It's important to note that the training dataset $D_{\text{train}}$, traditionally considered in-distribution in the OOD setting, often contains OMS (misclassified) samples since classifiers are rarely perfect. In summary, our study defines a good monitor as one that rejects incorrect predictions and accepts correct ones, regardless of whether the corresponding samples are considered in or out-of-distribution.

\subsubsection{Evaluation Dataset Construction}
Even in the OMS setting, it's crucial to evaluate a monitor's performance outside the training distribution, where misclassifications are more likely. Hence, in typical evaluations of monitors, in-distribution (ID) data and out-of-distribution (OOD) threat data are used jointly to assess performance. For ID test data, we usually use the test split associated with $D_{\text{train}}$. Threat data primarily fall into three categories:
\begin{enumerate*}
    \item \emph{Novelty}: The labels do not belong to the label space ($\Omega(x) \notin \mathcal{Y}$),
    \item \emph{Covariate Shift}: The inputs are not drawn from the same distribution as $D_{\text{train}}$,
    \item \emph{Adversarial Attacks}: The inputs are maliciously modified to cause misclassifications.
\end{enumerate*}
In the OMS setting, we use labeled datasets to identify errors of $f$ to serve as ground truth for the monitor evaluation. Both the test and threat sets may contain misclassifications. Additionally, except for novelty, the threat sets can contain correct predictions, depending on the degree of perturbations.

\subsubsection{Threshold Agnostic Evaluation Metrics}
A monitor is evaluated based on its ability to distinguish correctly classified data from misclassifications. Related works frequently use threshold-agnostic metrics to assess this skill across a range of thresholds. Examples of such metrics include AUROC, AUPR (Area under the Precision-Recall curve), and FNR@95TNR (False Negative Rate at 95\% True Negative Rate). 
However, to deploy a runtime monitor in a real-world scenario, one must select a fixed threshold value to decide which predictions to reject. As of today, no studies have addressed the generic problem of threshold selection for neural network monitoring. Threshold selection is typically addressed in a somewhat nebulous manner, suggesting that the ``threshold should be chosen such that a high proportion of ID data instances are accurately processed by the monitor'' \citep{liu2020energy, sun2021react, wang2022vim}.

\subsection{Threshold optimization for classification}
\label{sec:threshold_optim_classif}

Despite the absence of work addressing threshold fitting for NN runtime monitoring, some research has tackled this problem in the broader context of classification. \citet{arampatzis2001score} explained the steps involved in the exhaustive search method for threshold optimization on a finite test dataset:
\begin{enumerate*} 
\item Calculate the classification scores for all samples of the test dataset,
\item Sort the list of predicted scores, 
\item Select a metric to represent threshold performance, called \emph{effectiveness measure},
\item Calculate the effectiveness measure at every position of the sorted list,
\item Find the position where the effectiveness measure is optimal,
\item Set the threshold slightly above this optimal position.
\end{enumerate*}

In the literature, the most common variations of this standard optimization pipeline involve alternative choices for the effectiveness measure: % adapted to a specific application context: 
F-score \citep{ZOU20162}, geometric mean of Recall and Specificity \citep{GmeanF1}, Matthews correlation coefficient \citep{MCC}, or Cohen’s kappa \citep{FREEMAN200848}. 
Another research direction involves developing optimized search strategies to identify the threshold more efficiently \citep{arampatzis2001score, ghost}.

In this study, we compare four ways to construct the validation set used to optimize the threshold for runtime monitors.

%    __   ____ 
%   /__\ /___ \
%  / \////  / /
% / _  / \_/ / 
% \/ \_\___,_\ 
                                         
\section{Methodology}
\label{sec:research_questions}

Let us consider a monitor, that has been trained to produce scores reflecting the confidence of a NN. %, using $D_{\text{train}}$. 
Our goal is to compare different ways to build a validation dataset on which we can find an optimal threshold for these scores, to determine the predictions to reject. Although the process of finding a suitable threshold has received little attention in the literature, it is a crucial factor to consider. In practice, a monitor may generate scores that accurately distinguish incorrect predictions, but its safety could be compromised if the rejection threshold is not properly calibrated.

To evaluate the effectiveness of a given threshold, we employ conventional binary classification metrics, such as Recall and Precision, on a carefully designed test dataset, which we call \emph{Threshold Evaluation Set}. To have a balanced evaluation, we construct the Threshold Evaluation Set to encompass regular in-distribution data as well as one specific target threat. The inclusion of in-distribution data enables us to identify monitors that may overly reject, and focusing on a single threat allows us to characterize distinct monitor failures. This focus is more realistic, as it is unlikely for a NN to encounter multiple threats concurrently. We emphasize that our experiments address multiple threats in practice, but they are assessed separately to evaluate monitor performance across different threat scenarios.

To tune the threshold, we use a separate \emph{Threshold Optimization Set}. Fitting the threshold essentially involves identifying the value that optimizes a specific effectiveness measure on the Optimization Set (see Section~\ref{sec:threshold_optim_classif}). The chosen effectiveness measure should reflect the delicate balance between system safety and availability, i.e., it should encapsulate the monitor's capacity to reject incorrect predictions and to accept correct ones \citep{ISSRE22_guerin}. In our experiments, we try F1 and g-mean (see Section~\ref{sec:Proposed experiments}). Both the Threshold Optimization and Evaluation sets are composed of inputs to the NN (images), corresponding monitor scores, and labels that indicate the correctness of the predictions.

Our experiments compare four ways to construct the Optimization Set (Figure~\ref{fig:summary}). They reflect alternative real-world deployment scenarios for monitors, representing assumptions about our ability to anticipate forthcoming threats:
\begin{enumerate}
     
\item The first assumption, denoted ID, involves constructing an optimization set composed exclusively of In-Distribution (ID) data samples. This presumes that no threat data is accessible for threshold optimization. In the remaining approaches, ID samples are still present, along with other samples corresponding to threats. 

\item The second approach, denoted ID+T, involves enriching the optimization set with data samples associated with the Target threat (T), i.e., the threat under evaluation. This scenario corresponds to situations where threats pertinent to the system have been previously identified, such as through a system safety analysis.

\item The third approach, denoted ID+O, designs an optimization set without the target threat, but including samples corresponding to Other generic threats (O). This scenario examines if awareness of generic threats can aid in determining a more effective threshold for unanticipated, new threats.

\item The fourth approach, denoted ID+T+O, employs an optimization set containing data samples for both the Target and Other generic threats. It aims to assess the performance of a monitoring threshold when multiple threats are used and one of them is the target threat.
\end{enumerate}
A summary of how the Optimization and Evaluation sets are constructed for the different approaches can be found in Table~\ref{tab:thresholdoptimizationset}. It shows that the Evaluation Set is always the same and never overlaps with the Optimization set. 

\begin{table*}[t]
    \centering
    \caption{\textbf{Threshold Optimization and Evaluation sets} -- Methodology to construct the threshold optimization and evaluation sets for the different strategies considered in this study. Set~1 and Set~2 always denote non-overlapping splits of a dataset.}
    \label{tab:thresholdoptimizationset}
    \begin{tabular}{cc|C{35pt}|C{35pt}|C{35pt}|C{35pt}|c}
        \multicolumn{2}{c}{\multirow{2}{*}{}} &\multicolumn{2}{c|}{In-Distribution} & \multicolumn{2}{c|}{Target Threat} & Other Generic \\%\cline{3-6}
        
        \multicolumn{2}{c}{} & Set 1 & Set 2 & Set 1 & Set 2 & Threats\\ 
        \hline 
        
        \multirow{4}{*}{\shortstack{Threshold\\ Optimization}}
        & ID & \checkmark & & & &\\ \cline{2-7}
        & ID+T & \checkmark & & \checkmark & &\\ \cline{2-7}
        & ID+O & \checkmark & & & & \checkmark\\ \cline{2-7}
        & ID+T+O & \checkmark & & \checkmark & & \checkmark\\
        \hline
        \multicolumn{2}{c|}{Threshold Evaluation} & & \checkmark & & \checkmark & \\
    \end{tabular}
\end{table*}

The objective of comparing these four approaches is two-fold. First, we aim to assess the effectiveness of monitors when the target threat T is unknown, which reflects a more realistic scenario. This evaluation helps us understand if monitors, as evaluated in previous literature using threshold-agnostic metrics or optimization sets mirroring the training distribution, can be relied upon in real-world situations to protect systems from unknown threats. Our comparison aims to determine whether experiments from previous works are sufficient to draw conclusions about a monitor's real-world performance or if additional tests are needed before deployment. As a result, we formulate our first research question as: \emph{RQ1 -- Can we obtain similar monitoring performance without assuming prior knowledge of runtime threats during threshold tuning?}

To answer RQ1, we compare ID against ID+T, and ID+O against ID+T+O. If our findings reveal that prior awareness about the evaluated threat is crucial, it could significantly limit the applicability of runtime monitors. Indeed, the main objective of monitoring is to address unforeseen hazards. If knowledge about the actual threats that an NN will encounter is readily available, such examples would typically be incorporated during training. It is worth noting that several studies have used this strategy for tuning monitor hyperparameters by simply dividing the evaluation set into validation and test subsets \citep{hsu2020generalized}.

The second objective is to evaluate whether the strategy of adding a pool of generic threat data to tune the threshold can be viable to increase the robustness of the monitor to unforeseen threats. Such generic threats are easy to obtain by collecting additional image data from the internet or adding perturbations to ID data. On the one hand, adding such generic threat data can help generalization by adding difficult examples to better delineate the boundaries of what the NN knows. On the other, it could also be detrimental if the selected generic threats are too diverse. Hence, our second objective is to answer the following research question: \emph{RQ2 -- How helpful is the inclusion of generic threats data?}

For RQ2, we compare strategy ID against ID+O, as well as ID+T against ID+T+O. A positive answer would be promising, given the relative ease of constructing a generic dataset of threats, which could be utilized to enhance monitoring system performance and subsequently facilitate the adoption of neural networks in safety-critical systems.

%    __                      ___         _             
%   /____  ___ __   ___     /   \___ ___(_) __ _ _ __  
%  /_\ \ \/ | '_ \ / _ \   / /\ / _ / __| |/ _` | '_ \ 
% //__  >  <| |_) |  __/  / /_/|  __\__ | | (_| | | | |
% \__/ /_/\_| .__/ \___| /___,' \___|___|_|\__, |_| |_|
%           |_|                            |___/       
          
\section{Experimental Design}
\label{sec:Proposed experiments}

\subsection{Datasets, Models and Monitors}
To answer the aforementioned research questions, we conducted extensive experiments. To encapsulate varying ID scenarios, we use three image classification datasets: CIFAR10, CIFAR100~\citep{cifar10-100} and SVHN~\citep{svhn}. For each ID dataset, we use 2 distinct neural network architectures -- DenseNet and ResNet -- with weights taken from \cite{lee2018simple}. For Densenet, the test accuracies are: CIFAR10 (0.93), CIFAR100 (0.73), SVHN (0.88), and for ResNet: CIFAR10 (0.92), CIFAR100 (0.73), SVHN (0.89).

For each ID dataset and architecture pair, we implement four distinct monitoring techniques. Mahalanobis (Maha) \citep{lee2018simple} and Outside-the-Box (OtB) \citep{Henzinger2019OutsideTB} are feature-based approaches. We derive the feature representation from the final layer preceding classification and do not apply input pre-processing. On the other hand, Max Softmax Probability (MSP) \citep{hendrycks2016baseline} and Energy (Ene) \citep{liu2020energy} are logit-based methods. Regarding hyperparameters, we use num\_box=3 for OtB and T=1 for Ene. These settings resulted in a total of 24 monitors evaluated (3 ID datasets x 2 NNs x 4 monitors).

Each ID set is paired with nine unique threat sets to assess the monitors under varied circumstances:
\begin{itemize}
    \item 3 novelty sets (datasets with classes distinct from the ID set). For CIFAR 10, the corresponding novelty sets are CIFAR100, SVHN, and LSUN \citep{yu2015lsun}. CIFAR100 incorporates CIFAR10, SVHN, and LSUN while SVHN involves CIFAR10, LSUN, and TinyImageNet (a subset of ImageNet \citep{deng2009imagenet}).
    \item 3 covariate shifts (transformations from AugLy \citep{papakipos2022augly}), including Brightness (factor=3), Pixelization (ratio=0.5) and Blur (radius=2).
    \item 3 adversarial attacks (generated with Torchattacks \citep{kim2020torchattacks}) - FGSM, PGD, and DeepFool using the default settings.
\end{itemize}


\subsection{Threshold Optimization Methodology}

For each ID dataset--monitor pair, we cycle through the 9 threats, with each serving once as the Target threat (T), resulting in 9 unique outcomes for each optimization set construction approach. While assessing a target threat T, the remaining 8 threats serve as Other Generic Threats (O). The test split of the classifier's training dataset serves as the In-Distribution (ID) dataset. Then, both the ID set and the T set are randomly split in half, so that the threshold evaluation set and the four threshold optimization sets can be constructed, following the methodology presented in Section~\ref{sec:research_questions} (Table~\ref{tab:thresholdoptimizationset}). 
 
To optimize the threshold on the optimization set, we follow the methodology described in Section~\ref{sec:threshold_optim_classif}. For the effectiveness measure, we initially used F1, the harmonic mean between Precision and Recall, as it is a prevalent choice in the literature. Yet, early experiments revealed that F1 frequently resulted in the unfavorable action of setting exceedingly low thresholds, thereby rejecting all samples in the evaluation set. Of the 864 experiments conducted (24 monitors $\times$ 9 threats $\times$ 4 optimization sets), this outcome happened 116 times. Such behavior can be attributed to the significant class imbalance often observed in our optimization sets. Indeed, since classifiers typically commit fewer errors with ID data, the ID strategy predominantly contains negative examples (designated for acceptance), and other strategies, notably ID+O and ID+T+O, contain much more positive examples (designated for rejection).

We tested two distinct solutions to address this challenge:
\begin{enumerate}
\item over-sampling (OS) the minority class in the threshold optimization set to achieve a positive-to-negative ratio between 0.4 and 0.6,
\item using another effectiveness measure: g-mean, the geometric mean between Recall and Specificity. As Specificity solely considers samples with negative ground truth, g-mean is unaffected by class imbalance. A very low threshold results in a recall of 1 and a specificity of 0, and will not be favored by g-mean optimization.
\end{enumerate}
Both OS+F1 and g-mean approaches are compared in our experiments. 

\subsection{Evaluation Metrics and Statistical Synthesis}

Once a threshold is chosen, we evaluate its performance on the threshold evaluation set. For each experiment, we compute five evaluation metrics (F1, g-mean, Recall, Precision, Specificity) representing different aspects of the monitor's performance. Computing these diverse metrics allows us to analyze the impact of different threshold optimization approaches more finely.

Given the comprehensive scope of our experiments, we are left with 1728 recorded outcomes for each of these five metrics. Drawing definitive conclusions from such an expansive set of raw results is challenging. Even when we fix the effectiveness measure, we are still tasked with comparing the four threshold optimization approaches across 216 cases. Consequently, we resort to statistical testing to discern the distinctions between approaches across multiple results. Adhering to the methodology outlined by \citet{StatisticalComparison}, we employ the non-parametric Wilcoxon signed-rank tests for comparing two strategies over multiple scenarios (``no difference" null hypothesis, p-value<0.05 for significance). The Friedman test and its associated Nemenyi post-hoc test are utilized for comparing multiple strategies across multiple scenarios.

%    __                _ _       
%   /__\ ___ ___ _   _| | |_ ___ 
%  / \/// _ / __| | | | | __/ __|
% / _  |  __\__ | |_| | | |_\__ \
% \/ \_/\___|___/\__,_|_|\__|___/
                               
\section{Results}
\label{sec:Results}

The data from our 1728 experiments is complex and not immediately interpretable in its raw form. In this section, we present the outcomes of our statistical analysis and draw associated conclusions. For transparency and reproducibility, the raw results, as well as the code to replicate our experiments have been made available.\footnote{\url{https://github.com/jorisguerin/neural-network-monitoring-benchmark}}

\subsection{Comparing effectiveness measures}

First, we compare the two proposed effectiveness measures for threshold tuning on the Optimization set: over-sampling with F1 (OS+F1) and g-mean. For each of the 4 strategies and each of the 5 evaluation metrics, we compare these effectiveness measures using the Wilcoxon signed-rank tests across the 216 experiments. The Wilcoxon test is a non-parametric statistical test, used to compare the performance of two classifiers over multiple datasets \citep{StatisticalComparison}. The results obtained are shown in Table~\ref{tab:f1ups_vs_gmean_4monitors}. We find that OS+F1 generally yields better Recall and F1 scores, whereas g-mean optimization produces better Precision, Specificity, and g-mean scores. These findings indicate that the choice of the effectiveness measure should be based on the particular metric one seeks to optimize, and this choice should be aligned with the objectives of the system under test. A higher Recall corresponds to a more conservative system, i.e., fewer false acceptances from the monitor. Conversely, higher Precision and Specificity indicate an improved system availability, i.e., fewer false rejections from the monitor. More results comparing effectiveness measures can be found in Appendix~\ref{app:effectiveness_measures}.

\begin{table}[t]
    \centering
    \caption{\textbf{Effectiveness measures comparison (OS+F1 vs. g-mean)} -- Metrics were computed across the 216 experiments, followed by statistical comparison using the Wilcoxon test. The displayed numbers represent p-values, underlined orange text indicates OS+F1 is worse than g-mean, regular blue text indicates OS+F1 is better than g-mean, and italicized black text indicates no significant difference.}
    \begin{tabular}{c|C{30pt}C{30pt}C{30pt}C{40pt}}
        & \rotatebox[origin=c]{0}{ID} & \rotatebox[origin=c]{0}{ID+T} & \rotatebox[origin=c]{0}{ID+O} & \rotatebox[origin=c]{0}{ID+T+O} \\
\hline
        F1 &  \textcolor{orangeExperiment}{\underline{3e-08}} & \textcolor{blueExperiment}{3e-04} &  \textcolor{blueExperiment}{2e-04} &  \textcolor{blueExperiment}{8e-06} \\

        G-mean &  \textcolor{orangeExperiment}{\underline{1e-26}} &  \textcolor{orangeExperiment}{\underline{4e-29}} &  \textcolor{orangeExperiment}{\underline{2e-02}} &  
        \textit{5e-02} \\

        Recall &  \textcolor{blueExperiment}{4e-31} &  \textcolor{blueExperiment}{2e-32} &  \textcolor{blueExperiment}{4e-37} &  \textcolor{blueExperiment}{4e-37}\\

        Precision &  \textcolor{orangeExperiment}{\underline{1e-31}} &  \textcolor{orangeExperiment}{\underline{1e-32}} &  \textcolor{orangeExperiment}{\underline{9e-36}} &  \textcolor{orangeExperiment}{\underline{2e-35}}  \\

        Specificity &  \textcolor{orangeExperiment}{\underline{1e-31}} &  \textcolor{orangeExperiment}{\underline{2e-32}} &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}} \\
    \end{tabular}

    \label{tab:f1ups_vs_gmean_4monitors}
\end{table}

\subsection{Comparing Threshold Optimization set construction approaches}

Next, we compare the monitoring performance obtained with the different approaches to construct the Threshold Optimization set. To compare several approaches across experiments, we use the Friedman test and its corresponding Nemenyi post-hoc test, as recommended by \citet{StatisticalComparison}. The Friedman test is a non-parametric test comparing the average ranks of different models, with the null hypothesis assuming no significant difference between them. If the null hypothesis is refuted, the Nemenyi post-hoc test is then applied to identify which model has greater performance. 

More precisely, we compare the values obtained for the F1 and g-mean scores on the Threshold Evaluation sets. We focus on these metrics because they are both intended to represent a balance between over-rejection and over-acceptance. At the significance level of $\alpha = 0.05$, the Friedman test shows a significant difference in performance between the four threshold optimization approaches. The results of the Nemenyi test, with both OS+F1 and g-mean as effectiveness measures, are presented in Figure~\ref{fig:nemenyi_allmonitors}. These results allow us to formulate explicit responses to our research questions. We note that results for other evaluation metrics (Recall, Precision, Specificity) are given in Appendix~\ref{app:strategies}.

\paragraph{RQ1 -- Can we obtain similar monitoring performance without assuming prior knowledge of runtime threats during threshold tuning?} 
As anticipated, the best strategy is ID+T, where the Optimization set closely mirrors the Evaluation set. Interestingly, the ID+T+O and ID+O strategies consistently demonstrate statistically equivalent performance. This suggests that if one opts to utilize a large set of generic threats for threshold tuning, the inclusion of target threat data becomes useless. This is due to the fact that target threat data samples in the Threshold Optimization set are diluted among the other threats, diminishing their influence on the threshold selected.

\paragraph{RQ2 -- How helpful is the inclusion of generic threat data?} 
With OS+F1 as the effectiveness measure, the ID+O strategy outperforms ID. Conversely, with g-mean as the effectiveness measure, ID  outperforms ID+O. Hence, to know precisely the benefits of incorporating other generic threats, we perform a Wilcoxon test to compare the ID strategy optimized with g-mean to ID+O optimized with OS+F1. Our results reveal that the ID strategy is superior to ID+O when evaluating g-mean scores on the Evaluation sets (p-value=3e-10) and that there is no statistical difference between the two strategies for the F1 evaluation metric (p-value=0.2). In other words, without knowledge about the expected threats a system might face, it is preferable to rely solely on in-distribution data to determine the monitoring threshold and to use g-mean for optimization.

Figure~\ref{fig:nemenyi_allmonitors} also indicates that ID+T is better than ID+T+O. 
This suggests that supplementing the Threshold Optimization set with an arbitrary pool of threat data is not beneficial. If the target threat has been identified, it is advisable to use a combination of ID and specific threat data. Introducing data related to other random threats simply penalizes the monitor. However, it is worth noting that incorporating threats from more narrowly defined categories, closely aligned with the expected system threat, might offer improved generalization and could be explored in future research.

\begin{figure}[t]
\centering
\subfloat[F1 (effectiveness measure: OS+F1)]{\label{sfig:a}\includegraphics[width=.45\textwidth]{img/CDdiagramnemenyi_f1-score_evaluation_optimizef1.png}}\hfill
\subfloat[g-mean (effectiveness measure: OS+F1)]{\label{sfig:b}\includegraphics[width=.45\textwidth]
{img/CDdiagramnemenyi_gmean_evaluation_optimizef1.png}}\hfill
\subfloat[F1 (effectiveness measure: g-mean)]{\label{sfig:c}\includegraphics[width=.45\textwidth]{img/CDdiagramnemenyi_f1-score_evaluation_optimizegmean.png}}\hfill
\subfloat[g-mean (effectiveness measure: g-mean)]{\label{sfig:d}\includegraphics[width=.45\textwidth]{img/CDdiagramnemenyi_gmean_evaluation_optimizegmean.png}}\hfill

\caption{\textbf{Optimization sets comparison} -- Critical distance diagram (Nemenyi test). The horizontal axis represents the average rank of the strategies. A black bar connecting two or more strategies indicates no significant difference.}
\label{fig:nemenyi_allmonitors}

\end{figure}

\section{Qualitative discussion}
\label{sec:Discussion}

As anticipated, superior results were obtained for ID+T, i.e., tuning the threshold with data closely mirroring the evaluation dataset yielded the best results. However, the decreased performance observed when adding generic threats to the Optimization set is less intuitive. In this section, we propose to try to understand this behavior through an example.

To ensure that the chosen example offers meaningful insights, we select a case where the performance differences across strategies align with the conclusions presented above. For clear visualization, we require a monitor that exhibits good separability (AUROC $>$ 0.8 on the Evaluation set), and we select the scenario that shows the maximum performance variability among strategies. Details about the chosen example can be found in Appendix~\ref{app:extreme}. Figure~\ref{fig:extremeCase} shows the distributions of monitoring scores of the Threshold Optimization sets for the ID, ID+O, and ID+T strategies, as well as for the Threshold Evaluation set. The thresholds derived from both effectiveness measures are also displayed. 

\begin{figure*} [t]
\centering
\subfloat[Threshold Optimization set (ID)]{\label{sfig:a}\includegraphics[width=.45\textwidth, height=0.23\textwidth]{img/Distribution_optimizationset_extreme_id_MahalanobisMonitor_cifar10_cifar10_fgsm_None_resnet.png}}\hfill
\subfloat[Threshold Optimization set (ID+T)]{\label{sfig:b}\includegraphics[width=.45\textwidth, height=0.23\textwidth]{img/Distribution_optimizationset_extreme_id-t_MahalanobisMonitor_cifar10_cifar10_fgsm_None_resnet.png}}\hfill
\subfloat[Threshold Optimization set (ID+O)]{\label{sfig:c}\includegraphics[width=.45\textwidth, height=0.23\textwidth]{img/Distribution_optimizationset_extreme_id-0_MahalanobisMonitor_cifar10_cifar10_fgsm_None_resnet.png}}\hfill
\subfloat[Threshold Evaluation set]{\label{sfig:d}\includegraphics[width=.45\textwidth, height=0.23\textwidth]{img/Distribution_evalset_extreme_MahalanobisMonitor_cifar10_cifar10_fgsm_None_resnet.png}}

\caption{\textbf{Visual example to explain our findings} -- Distributions of monitoring scores for the Optimization and Evaluation sets. Selected example: ID data: CIFAR10, threat: FGSM, NN: Resnet, monitor: Mahalanobis. Vertical lines represent thresholds obtained with different effectiveness measures. In (d), the dashed (resp. plain) lines represent thresholds obtained with OS+F1 (resp. g-mean). The ``Optimal'' thresholds maximize the effectiveness measures on the Evaluation set. \vspace{-10pt}}
\label{fig:extremeCase}
\end{figure*}

Examining Figures~\ref{sfig:b} and \ref{sfig:d}, we observe that the ID+T strategy yields score distributions most resembling those in the Evaluation set, leading to near-optimal thresholds, especially when using g-mean. In contrast, the ID strategy (Figure~\ref{sfig:a}) shows error scores (in blue) that are too close to the correct ones, resulting in smaller thresholds. However, it is worth noting that the ID strategy performs particularly well for this example, likely due to FGSM attacks generating images closely resembling the originals.

Figure~\ref{sfig:c} shows the limitations of ID+O. Interestingly, the failures differ based on the effectiveness measure used. With OS+F1, the threshold is too small because the error score distribution stretches excessively to the left. As F1 tries to minimize missed errors, i.e., maximize Recall, it pushes for a smaller threshold. Conversely, with g-mean, the threshold is excessively high because the correct score distribution stretches excessively to the right. This is due to g-mean optimization prioritizing reducing false rejections to maintain Specificity.  The wide spread in ID+O scores can be attributed to the large variety of threat data, containing both correctly classified data deviating from the training distribution to imperceptible threats triggering errors.

%    ___                 _           _             
%   / __\___  _ __   ___| |_   _ ___(_) ___  _ __  
%  / /  / _ \| '_ \ / __| | | | / __| |/ _ \| '_ \ 
% / /__| (_) | | | | (__| | |_| \__ | | (_) | | | |
% \____/\___/|_| |_|\___|_|\__,_|___|_|\___/|_| |_|

\section{Conclusion}%and future works
\label{sec:conclusion}

\vspace{-15pt}
In this study, we undertook a comprehensive experimental exploration of different ways to build threshold optimization datasets for NN runtime monitoring. Our findings yielded valuable insights into the effectiveness of these approaches and their implications for real-world applications.

Our research affirmed that the ID+T approach, which leverages knowledge of the anticipated system threat to establish optimal thresholds for monitors, outperforms all other approaches. However, it is crucial to acknowledge that assuming prior knowledge of the threat is impractical for safety-critical applications, where monitors are typically designed to safeguard systems against unforeseen threats. Our findings demonstrate that we cannot expect comparable monitoring results without such prior knowledge, potentially casting doubt on the representativeness of prior evaluation results, which employed either threshold-agnostic metrics or similar data to the test set, assuming such prior knowledge. 

We also investigated the inclusion of generic threat data in the threshold optimization process. Surprisingly, our experiments revealed this approach can actually compromise monitor performance. The example discussed in Section~\ref{sec:Discussion} suggests that incorporating
data samples from unrelated threats results in overly dispersed distributions of correct and error scores, leading to suboptimal outcomes. This raises a promising avenue for future research: exploring the integration of data samples from more narrowly defined threat categories. This approach could facilitate the design of monitors tailored to specific classes of anticipated threats, such as adversarial attacks. However, its success requires the rigorous safety analysis of the system to identify relevant threats and customize optimization sets accordingly.

Furthermore, we examined the choice of effectiveness measures for selecting thresholds on the optimization set. Our findings highlight that the appropriate effectiveness measure hinges on the specific objectives of the monitor. F1 with over-sampling yields conservative monitors reducing missed errors, while using g-mean encourages higher system availability by reducing false rejections.

Our study offers a versatile experimental methodology that can be adapted to explore several other interesting questions.
First, as many studies split the evaluation dataset into validation and test sets for parameter optimization, which is equivalent to employing the ID+T approach, our framework could provide deeper insights into how much monitoring techniques rely on target threat knowledge for hyperparameter-tuning.
We also aim to extend these results to other tasks, such as object detection, to formulate more comprehensive and universally applicable guidelines for crafting robust neural network monitoring systems. Finally, it would be interesting to investigate whether different families of threats react differently to the proposed strategies.

\newpage
\bibliography{uai2024_conference}
% \bibliographystyle{plainnat}
\newpage
\onecolumn


\title{Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network Monitoring\\(Supplementary Material)}

\maketitle

\input{appendix}

\end{document}
