\appendix
\label{Appendix}


\section{Further comparisons of effectiveness measures}
\label{app:effectiveness_measures}

In section~\ref{sec:Results}, we compared two effectiveness measures for threshold tuning: g-mean (geometric mean of Recall and Specificity), and F1 (harmonic mean of Precision and Recall) complemented with over-sampling (OS+F1). Here, we extend this analysis to include F1 without over-sampling, and a typical threshold used in the literature, chosen such that the True Negative rate of the Threshold Optimization set is set to 0.95. We use the same protocol for comparison, employing the Wilcoxon signed-rank test to determine what effectiveness measure yields better performance across five evaluation metrics. We note that the performance is always evaluated on the appropriate Threshold Evaluation sets.

To confirm that the oversampling approach is beneficial, we compare using F1-score with and without oversampling as effectiveness measures. Table~\ref{tab:f1ups_vs_f1noups_allmonitor} shows the results. In general, oversampling gives better F1 and g-mean scores on the Threshold Evaluation set. We can conclude that this oversampling strategy works and reduces the bad behavior of rejecting all inputs in imbalanced scenarios (see Section \ref{sec:Proposed experiments}).

\begin{table}[htbp]
    \centering
    \caption{\textbf{Effectiveness measures comparison (F1 with oversampling vs. F1 without oversampling)} -- Metrics were computed across the 216 experiments, followed by statistical comparison using the Wilcoxon test. The displayed numbers represent p-values, underlined orange text indicates F1 with oversampling is worse than F1 without oversampling, regular blue text indicates F1 with oversampling is better than F1 without oversampling, and italicized black text indicates no significant difference.}
    \begin{tabular}{c|C{50pt}C{50pt}C{50pt}C{50pt}}
        & \rotatebox[origin=c]{0}{ID} & \rotatebox[origin=c]{0}{ID+T} & \rotatebox[origin=c]{0}{ID+O} & \rotatebox[origin=c]{0}{ID+T+O} \\
\hline
        F1 &  \textit{9e-01} & \textcolor{orangeExperiment}{\underline{2e-24}} &  \textcolor{blueExperiment}{3e-12} &  \textcolor{blueExperiment}{3e-13} \\

        G-mean & \textit{6e-02} &  \textcolor{blueExperiment}{8e-08} &  \textcolor{blueExperiment}{3e-23} &  \textcolor{blueExperiment}{6e-26} \\

        Recall &  \textcolor{blueExperiment}{1e-35} &  \textcolor{blueExperiment}{4e-06} &  \textcolor{orangeExperiment}{\underline{7e-18}} &  \textcolor{orangeExperiment}{\underline{3e-26}}\\

        Precision &  \textcolor{orangeExperiment}{\underline{4e-35}} &  \textcolor{orangeExperiment}{\underline{1e-08}} &  \textcolor{blueExperiment}{2e-20} &  \textcolor{blueExperiment}{3e-26}  \\

        Specificity &  \textcolor{orangeExperiment}{\underline{1e-35}} &  \textcolor{orangeExperiment}{\underline{2e-03}} &  \textcolor{blueExperiment}{3e-22} &  \textcolor{blueExperiment}{4e-26} \\
    \end{tabular}

    \label{tab:f1ups_vs_f1noups_allmonitor}
\end{table}


In the literature, it is common to use FNR@95TNR (False Negative Rate at 95\% True Negative Rate)~\cite{liu2020energy, sun2021react, wang2022vim} as a monitoring evaluation metric. This means that the threshold is set such that 95\% of correct predictions are actually accepted by the monitor. Here, we evaluate this standard literature threshold against the threshold obtained from proper optimization with g-mean as the effectiveness measure. Table~\ref{tab:95TNR_vs_gmean_allmonitors} clearly shows that threshold optimization is better than 95\% TNR for balanced metrics (F1 and g-mean). Precision and Specificity are better for 95\% TNR by construction. We also note that similar results were obtained when comparing 95\% TNR against OS+F1.  

\begin{table}[H]
    \centering
    \caption{\textbf{Effectiveness measures comparison (@95TNR vs. g-mean)} -- Metrics were computed across the 216 experiments, followed by statistical comparison using the Wilcoxon test. The displayed numbers represent p-values, underlined orange text indicates that the metric score with threshold chosen @95TNR is worse than optimized with g-mean, regular blue text indicates that the metric score with threshold chosen @95TNR is better than optimized with g-mean, and italicized black text indicates no significant difference.}
    \begin{tabular}{c|C{50pt}C{50pt}C{50pt}C{50pt}}
        & \rotatebox[origin=c]{0}{ID} & \rotatebox[origin=c]{0}{ID+T} & \rotatebox[origin=c]{0}{ID+O} & \rotatebox[origin=c]{0}{ID+T+O} \\
\hline
        F1 &  \textcolor{orangeExperiment}{\underline{6e-16}} & \textcolor{orangeExperiment}{\underline{4e-30}} &  \textcolor{orangeExperiment}{\underline{4e-37}} &  \textcolor{orangeExperiment}{\underline{4e-37}} \\

        G-mean &  \textcolor{orangeExperiment}{\underline{3e-21}} &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}} \\

        Recall &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}} &  \textcolor{orangeExperiment}{\underline{3e-37}}\\

        Precision &  \textcolor{blueExperiment}{7e-35} &  \textcolor{blueExperiment}{1e-26} &  \textcolor{blueExperiment}{1e-25} &  \textcolor{blueExperiment}{1e-22}  \\

        Specificity &  \textcolor{blueExperiment}{3e-37} &  \textcolor{blueExperiment}{3e-37} &  \textcolor{blueExperiment}{3e-37} &  \textcolor{blueExperiment}{3e-37} \\
    \end{tabular}

    \label{tab:95TNR_vs_gmean_allmonitors}
\end{table}

\section{Further Comparisons of Optimization Set Construction Approaches}
\label{app:strategies}

In this section, we present more performance comparisons of the four approaches for constructing the Threshold Optimization set, using other evaluation metrics (Recall, Precision, and Specificity). The results obtained with OS+F1 as the effectiveness measure are given in Figure~\ref{fig:nemenyi_allmonitors_extraoptf1} and the results obtained with g-mean as the effectiveness measure are given in Figure~\ref{fig:nemenyi_allmonitors_extraoptmean}.

\begin{figure}[htbp]
\centering

\subfloat[Recall (effectiveness measure: OS+F1)]{\label{sfig:4a}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_recall-score_evaluation_optimizef1.png}}\hfill
\subfloat[Precision (effectiveness measure: OS+F1)]{\label{sfig:4b}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_precision-score_evaluation_optimizef1.png}}\hfill

\subfloat[Specificity (effectiveness measure: OS+F1)]{\label{sfig:4c}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_specificity-score_evaluation_optimizef1.png}}\hfill
  \caption{\textbf{Threshold Optimization sets comparison, with OS+F1 as the effectiveness measure} -- Critical distance diagram showing the results of the Nemenyi test. The horizontal axis represents the average rank of the approaches. A black bar connecting two or more approaches indicates no significant difference.}\label{fig:nemenyi_allmonitors_extraoptf1}

\end{figure}

\begin{figure}[H]
\centering

\subfloat[Recall (effectiveness measure: g-mean)]{\label{sfig:5a}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_recall-score_evaluation_optimizegmean.png}}\hfill
\subfloat[Precision (effectiveness measure: g-mean)]{\label{sfig:5b}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_precision-score_evaluation_optimizegmean.png}}\hfill

\subfloat[Specificity (effectiveness measure: g-mean)]{\label{sfig:5c}\includegraphics[width=.47\textwidth]{img/CDdiagramnemenyi_specificity-score_evaluation_optimizegmean.png}}\hfill
  \caption{\textbf{Threshold Optimization sets comparison, with g-mean as the effectiveness measure} -- Critical distance diagram showing the results of the Nemenyi test. The horizontal axis represents the average rank of the approaches. A black bar connecting two or more approaches indicates no significant difference.}\label{fig:nemenyi_allmonitors_extraoptmean}

\end{figure}

\newpage
\section{Additional information on the example discussed in Section~\ref{sec:Discussion}}
\label{app:extreme}

Section~\ref{sec:Discussion} aims to discuss a qualitative example to illustrate and better understand the results obtained from our experimental analysis. For clear visualization, we selected a scenario that exhibits good separability (AUROC $>$ 0.8 on the Evaluation set), and that shows the maximum performance variability among different approaches. Consequently, the selected scenario is composed of the Mahalanobis monitor, used with the Resnet NN on the CIFAR10 ID dataset, and with the FGSM attack as the threat.

Tables~\ref{tab:extreme_add_info1} and \ref{tab:extreme_add_info2} show additional information about this example. More specifically, we present the values taken by the five evaluation metrics on the Threshold Evaluation set, as well as the AUROC score for each threshold optimization approach (ID, ID+T, ID+O), and each effectiveness measure.

\begin{table}[H]
    \centering
    \caption{\textbf{Monitoring performances for the selected qualitative example, with OF+F1 as the effectiveness measure} -- Measured metrics scores on the Threshold Evaluation set with different approaches, with OS+F1 as the effectiveness measure.}

    \begin{tabular}{lllllll}
    \hline
        Approach & F1 & g-mean & recall & precision & specificity & AUROC\\ \hline
        ID & 0.587 & 0.730 & 0.971 & 0.421 & 0.549 & 0.848\\ \hline
        ID+T & 0.636 & 0.787 & 0.909 & 0.490 & 0.681 & 0.848\\ \hline
        ID+O & 0.629 & 0.780 & 0.937 & 0.473 & 0.649 & 0.848\\ \hline
    \end{tabular}
    \label{tab:extreme_add_info1}
\end{table}

\begin{table}[H]
    \centering
    \caption{\textbf{Monitoring performances for the selected qualitative example, with OF+F1 as the effectiveness measure} -- Measured metrics scores on the Threshold Evaluation set with different approaches, with g-mean as the effectiveness measure.}
    \begin{tabular}{lllllll}
    \hline
        Approach & F1 & g-mean & recall & precision & specificity & AUROC\\ \hline
        ID & 0.636 & 0.787 & 0.909 & 0.490 & 0.681 & 0.848\\ \hline
        ID+T & 0.643 & 0.791 & 0.879 & 0.507 & 0.713 & 0.848\\ \hline
        ID+O & 0.589 & 0.719 & 0.613 & 0.568 & 0.843 & 0.848\\ \hline
    \end{tabular}


    \label{tab:extreme_add_info2}
\end{table}
