\section{Experimental results}
\label{sec:exp}
\subsection{Experimental Setup}

We compare the convergence of different Rare Event Simulation methods: our Adversarial-Attack Driven IS of Sect.~\ref{sec:ProposedMethod} (which we abbreviate by ADV-IS), the Line Sampling (LS) estimator~\eqref{eq:LineSampling}, the Cross-Entropy Importance Sampling (CE-IS)~\eqref{eq:ImpSamp}~\eqref{eq:CE}, and two estimators based on Sequential Monte Carlo (SMC) techniques, the Multilevel Splitting \citep{beck_mls} and a Langevin Monte Carlo within an SMC scheme \citep{titaistats}, that we note respectively MLS-SMC and MALA-SMC (MALA stands for Metropolized Langevin Algorithm). An important parameter for these SMC methods, in addition to the number of samples $N$, is the number $T$ of applications of a transition kernel, which
%makes it possible to approximate a non-normalized distribution and
reduces the dependence between samples. Theoretical guarantees are derived under the perfect independence ($T=\infty$). In practice, $T<\infty$ has a huge impact on the number of calls to the NN. 

We consider three models across two datasets and apply uniform noise to different instances. For each instance, we compute a reference probability of failure $\pfest^{\text{Ref}}$ by using an expensive IS compute (taking $N$ of the order $10^6$) and we check a posteriori that all methods converge towards the same value. In addition to benchmarking the rare event simulation methods, we compute both the FORM estimate $\pfail^{\FORM}$ and, whenever possible, the SORM estimate $\pfail^{\SORM}$, as defined above, using different search methods. These estimators are quantitively compared thanks to two metrics:
\begin{itemize}
    \item The coefficient of variation $\Delta[\cdot]$, defined for an estimator $\pfest$ as, $\Delta[\pfest]=\frac{\sqrt{\V[\pfest]}}{\E[\pfest]}$.
    \item The relative mean absolute error, note $\mathrm{RE}[\cdot]$, define as: $\mathrm{RE}[\pfest]=\E[|\pfail-\pfest|]\cdot \pfail^{-1}$.
\end{itemize}
In practice, we have to estimate these metrics by their empirical counterpart. Moreover, as $\Rerror$ explicitly involves the failure probability, we will use the reference probability $\pfest^{\text{Ref}}$ as a surrogate. Crucially, for a fair comparison, these metrics and the complexity of an estimator (gauged by the number of calls) are measured over the same runs.  
All experiments were run on a personal laptop, with a 4060RTX GPU. 
All the code will be made available publicly on GitHub once the reviewing will be over.

\subsection{MNIST}
\subsubsection{MLP with two hidden layers}
We first compare these methods via experiments on a simple Multi-Layer Perceptron (MLP) with only 2 hidden layers (each containing 200 neurons) trained on the MNIST dataset \citep{mnist}, which will be referred to as model $\model{1}$, and on a first instance we note $\inputs{1}$. We consider an additive noise perturbation, uniform on the $\ell_{\infty}$ ball of radius $\varepsilon=0.18$ and centered on $\inputs{1}$, see Figure \ref{fig:example_1}. This distribution can be mapped to the standard Gaussian law via the isoprobabilistic transform mentioned in Sect.~\ref{sec:Uspace}. At this level of noise, the probability of misclassification is low. Running an expensive simulation we find that $\pfest^{\text{Ref}} \approx 1.95 \cdot 10^{-6}$.

\begin{figure}[tb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/MNIST/examples_mnist/examples_noisy_img_idx_0_eps_0.18.pdf}
  \caption{Input $\inputs{1}$ (on the left) and examples of perturbations with uniform noise $\varepsilon=0.18$.} 
  \label{fig:example_1}
\end{figure}

We apply the FORM and SORM methods with three adversarial attacks, the Carlini-Wagner attack, FMNA attack, and HLRF attacks. Indeed, the dimension is $d=784$ for this dataset and it is possible to manipulate matrices of size $d\times d$ and in particular to evaluate, via auto-differentiation, the Hessian of $G$. Table~\ref{tab:form_1_1} presents the results. At a glance, it is clear that FORM significantly overestimates the probability of failure when the FMNA and HLRF attacks find the design point~\eqref{eq:DesignPoint}, but underestimates it with the CW attack. This indicates that the decision boundary at $u^\star$ is not "flat" enough for a linear approximation to hold. This idea is further reinforced by observing that the SORM estimators are indeed closer to the actual probability of failure. In addition, we note that, here, the CW attack performed poorly, as its norm is higher in comparison with that of the two other attacks. Moreover, the Hessian $\nabla^2 h=-\nabla^2 G$ has both positive and negative eigenvalues at the CW point, whereas it only has non-positive eigenvalues at the other attack points.

\begin{figure}[b]
  \centering
  \includegraphics[width=1.\linewidth]{figures/form_sorm/attack_examples_model_release_img_idx_0.pdf}
  \caption{Adversarial attacks for model $\model{1}$ on input $\inputs{1}$. } 
  \label{fig:form_sorm_1}
\end{figure}

\begin{table}
    \centering
    %\resizebox{\columnwidth}
    {
    \caption{\label{tab:form_1_1} FORM/SORM estimations of $\pfest^{\text{Ref}}\approx  1.95\cdot 10^{-6}$ for model $\model{1}$ and input $\inputs{1}$, with uniform noise ($\varepsilon=0.18$).}
    \begin{tabular}{|c|c|c|c|c|c|}
        \hline
        Attack & $\pfail^{\text{FORM}}$ & $\pfail^{\text{SORM}}$ & $\cos(\tilde{u}^*,\nabla G(\tilde{u}^*))$  \\
        \hline
        CW & $7.2\cdot 10^{-8}$ & $6.39\cdot 10^{-6}$& $ -0.69$  \\
        FMNA & $1.17\cdot 10^{-4}$ & $6.49\cdot 10^{-6}$& $ -0.995$  \\
        HLRF & $7.53\cdot 10^{-5}$ & $6.65\cdot 10^{-6}$& $ -0.977$  \\
        \hline
        & $\|\tilde{u}^*\|_2$ & $G(\tilde{u}^*)$ & Time (in sec.)  \\
        \hline
        CW & $5.26$ & $-4.1\cdot 10^{-5}$& $ 0.19 $  \\
        FMNA & $3.68$ & $-1.4\cdot 10^{-5}$& $ 0.16$  \\
        HLRF & $3.79$ & $-2.0\cdot 10^{-2}$& $ 0.01$  \\
        \hline
    \end{tabular}%
    }
\end{table}

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1.05\linewidth]{figures/form_sorm/eigvals_model_mnist_model_FMNA_idx_0.pdf}
  \caption{Eigenvalues of the Hessian of $h$ at the CW attack (on the left), at the FMNA attack (in the center), and the HLRF attack (on the right).} 
  \label{fig:example_1}
\end{figure}

We next, look at the convergence of the statistical methods with respect to the average number of calls, noted $\bar{N}_{\text{calls}}$. In Figure \ref{fig:convergence_1} we see that all methods seem to converge towards the reference probability as the average number of calls increases, though their convergence rate differs. In particular, the Sequential Monte Carlo methods, MALA-SMC and MLS-SMC, converge noticeably
slower than the LS and ADV-IS methods. The cross-entropy (CE) IS method has a significant overhead as it must first converge towards a good parameter $\boldsymbol{\theta}$, before exploiting its final distribution to compute an estimate of $\pfail$. We focus on the IS and LS methods in Figure \ref{fig:convergence_2}, comparing their speed of convergence for different adversarial attacks. 
These figures are obtained by: running each method 400 times (with different random seeds to obtain standard errors) using a given number of samples $N$ and repeating the same operation for increasing values of $N$. For example, we ran the ADV-IS for values of $N$ in the range $\{100,1000,10000,50000,100000\}$. 


\begin{figure}[!htb]
  \centering
  \includegraphics[width=\linewidth]{figures/MNIST/comparisons/comp_methods_model_release_noise_dist_uniform_input_index_0_1.pdf}
  \caption{Convergence of different estimators w.r.t. the number of calls to the model $\model{1}$.} 
  \label{fig:convergence_1}
\end{figure}

\begin{figure}[!htb]
  \centering
  \includegraphics[width=\linewidth]{figures/MNIST/comparisons/comp_methods_model_release_noise_dist_uniform_input_index_0_2.pdf}
  \caption{Convergence of IS and LS with different attacks.} 
  \label{fig:convergence_2}
\end{figure}

Finally, we give the best performance of each algorithm (with respect to the number of samples used) in terms of the coefficient of variation multiplied by a measure of the computational burden. In practice, we use either the number of calls to the model $\bar{N}_{\text{calls}}$ (i.e. the metric $\hat{\Delta}^2[\pfest]\times \bar{N}_{\text{calls}}$), or the duration of the simulation in seconds (i.e. the metric  $\hat{\Delta}^2[\pfest]\times \text{time}$).
Table~\ref{tab:best_perf} reports the results where $N_{\text{best}}$ denotes the number of samples that gave the best performance in terms of the metric $\hat{\Delta}^2[\pfest]\times \bar{N}_{\text{calls}}$. All metrics reported in this table pertain to the ADV-IS method outperforms all other methods, for both metrics mentioned above. The CE-IS method also obtains good performance, for a relatively low number of samples $N_{\text{best}}$ used for estimation. However, the \textit{total} number of calls needed for CE-IS is in the order of \textit{hundreds of thousands}. 

\begin{table}
    \centering
    \caption{\label{tab:best_perf} Best performance of estimators of $\pfail$ for the model $\model{1}$ and input $\inputs{1}$, with uniform noise ($\varepsilon=0.18$).}
    \begin{tabular}{|c|c|c|c|c|c|}
        \hline
        Method & $N_{\text{best}}$ & time (sec.)& $\Rerror[\pfest]$    \\
        \hline
        ADV-IS & $5\cdot 10^{4}$ & $5\cdot 10^{-2}$& $2.5\cdot 10^{-2}$ \\
        CE-IS & $3\cdot 10^{4}$ & $2.3\cdot 10^{-1}$&$4.3\cdot 10^{-2}$ \\
        LS & $50$ & $4.3\cdot 10^{-2}$& $2.1\cdot 10^{-1}$ \\
        MALA & $256$ & $2.0\cdot 10^{-1}$& $2.1\cdot 10^{-1}$  \\
        MLS & $1024$ & $2.5\cdot 10^{-2}$& $2.6\cdot 10^{-1}$  \\
        \hline
        & $\hat{\Delta}^2[\pfest]\times \bar{N}_{\text{calls}}$  & $\hat{\Delta}^2[\pfest]\times \text{time}$ &   $\bar{N}_{\text{calls}}$ \\
        \hline
        ADV-IS & $48$ &  $4.8\cdot 10^{-5}$  &$5\cdot 10^{4}$ \\
        CE-IS & $460$ & $7\cdot 10^{-4}$& $1.5\cdot 10^{5}$  \\
        LS & $77$ & $2.9\cdot 10^{-3}$& $ 1200$  \\
        MALA & $3000$ & $1.5\cdot 10^{-2}$& $4\cdot 10^{4}$  \\
        MLS & $6200$ & $2.7\cdot 10^{-3}$& $5.7\cdot 10^{4}$  \\
        \hline
    \end{tabular}
\end{table}


\subsubsection{MLP with four hidden layers}
We now consider a similar MLP architecture with four hidden layers (each hidden layer containing 200 neurons), denoted $\model{2}$. Simulation results for the FORM and SORM algorithms are given in the Appendix. Overall, these results support the idea that the decision boundaries of neural networks do not appear to be (locally) flat enough to be accurately approximated by hyperplanes, as the FORM method tends to overestimate the probability by an order of 10 or more. In contrast, the SORM method shows promising results, with the caveat that it systematically underestimates the probability of failure, which can be problematic when considering safety-critical applications. 
Focusing now on statistical estimators, we study their empirical convergence, for two images $\inputs{1}$ and $\inputs{2}$, with similar perturbations as in the previous section, i.e. uniform noise on $\ell_{\infty}$ balls of radius $\varepsilon=0.18$. Simulation results are reported in Figure \ref{fig:converge_2_1}.






\begin{figure}[b]%[!htb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/MNIST/comparisons/comp_methods_model_dnn4_mnist_noise_dist_uniform_input_index_1.pdf}
  \caption{Convergence of the estimators w.r.t. the number of calls to the model $\model{2}$, on the input $\inputs{2}$} 
  \label{fig:converge_2_1}
\end{figure}

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/MNIST/comparisons/comp_methods_model_dnn4_mnist_noise_dist_uniform_input_index_2.pdf}
  \caption{Convergence of different estimators w.r.t. the number of calls to the model $\model{2}$, on the input $\inputs{3}$} 
  \label{fig:converge_2_1}
\end{figure}

Like in previous experiments, the SMC-based algorithms converge much slower than both LS and the adversarial-attack-driven IS algorithm, though the gap is slightly less important in the case of input $\inputs{3}$, which has a higher probability of failure, leading in particular to less dramatic underestimation of the MLS algorithm when using a smaller number of samples. Interestingly, in this example, the MLS algorithm, which is a black-box method, seems to slightly outperform the MALA-SMC algorithm that uses gradient information \cite{titaistats}.

%\begin{figure}[!htb]
 % \centering
  %\includegraphics[width=1.\linewidth]{figures/form_sorm/%attack_examples_model_dnn4_img_idx_1.pdf}
  %\caption{Estimation of $\pfail  1.7\cdot 10^{-8}$ with %FORM and SORM using different Adversarial attacks, on 2nd %input. } 
 % \label{fig:form_sorm_2}
%\end{figure}



\subsection{CIFAR10}

We move on to the CIFAR10 dataset, which is more challenging for rare event simulation as the dimension of each input is $d=32^2\times 3=3072$. We run experiments on a custom convolutional neural network, which contains four convolutional layers, followed by two dense layers and contains in total of $476\,278$ scalar parameters.
\begin{figure}[b]
  \centering
  \includegraphics[width=1.\linewidth]{figures/CIFAR/examples_cifar10/examples_noisy_img_idx_5_sigma_0.02.pdf}
  \caption{Clean input of the CIFAR10 dataset (on the left) and copies perturbed with Gaussian noise ($\sigma=0.02$).} 
  \label{fig:eigen_1}
\end{figure}


As before, we applied the FORM algorithm using different adversarial attacks, and the associated results are reported in Table~\ref{tab:cnn_form}. However, it is not possible to apply the SORM algorithm, as it requires too much memory capacity and computing power.



\begin{table}
    \centering
    \caption{\label{tab:cnn_form} FORM/SORM estimations of $\pfail\approx  2.4\cdot 10^{-7}$ for the custom CNN model, with uniform noise ($\varepsilon=0.03$).}
    \begin{tabular}{|c|c|c|c|c|c|}
        \hline
        Attack & $\pfail^{\text{FORM}}$ & $\pfail^{\text{SORM}}$ & $\cos(\tilde{u}^*,\nabla G(\tilde{u}^*))$  \\
        \hline
        CW & $3.91\cdot 10^{-5}$ & NA& $ -0.97$  \\
        FMNA & $5.22\cdot 10^{-5}$ & NA& $ -0.985$  \\
        HLRF & $2.16\cdot 10^{-5}$ & NA& $ -0.965$  \\
        \hline
        & $\|\tilde{u}^*\|_2$ & $G(\tilde{u}^*)$ & Time (in sec.)  \\
        \hline
        CW & $3.95$ & $-1.2\cdot 10^{-4}$& $ 1.49 $  \\
        FMNA & $3.88$ & $-8.0\cdot 10^{-5}$& $ 0.23$  \\
        HLRF & $4.09$ & $-8.1\cdot 10^{-2}$& $ 0.03$  \\
        \hline
    \end{tabular}
\end{table}

We next focus on the simulation algorithms' performance. Again, we primarily compare the LS and adversarial-attack-driven IS algorithm to sequential Monte Carlo methods used in the literature \citep{webb2018statistical,titaistats}. The associated results are reported in Figure \ref{fig:cnn_conv_1} below.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/CIFAR/comp_methods_model_cnn_custom_cifar10_noise_dist_uniform_input_index_5_1.pdf}
  \caption{Convergence of different estimators w.r.t. the number of calls to the CNN.} 
  \label{fig:cnn_conv_1}
\end{figure}

We obtain similar results to that obtained for MNIST data: Our method and Line Sampling converge in a few thousand calls whereas state-of-the-art SMC algorithms require a few \textit{hundreads} thousands of calls to obtain similar standard errors. That being said, the performance gap is somewhat smaller, a fact we attribute to the curse of dimension (COD), leading to weight degeneracy in Importance Sampling \citep{cod_is}.

Figure~\ref{fig:cnn_conv_2} compares the performance of the adversarial attacks.
We notice again very small differences in terms of performance for the FMNA and HLRF algorithms. This means that the HLRF algorithm we have implemented for Neural Networks proves to be a powerful adversarial attack.%, though more research is needed.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/CIFAR/comp_methods_model_cnn_custom_cifar10_noise_dist_uniform_input_index_5_2.pdf}
  \caption{Convergence of different estimators w.r.t. the number of calls to the CNN.} 
  \label{fig:cnn_conv_2}
\end{figure}



\subsection{ImageNet Results}

Finally, we conclude this section with experimental results obtained on the ImageNet \citep{imagenet} dataset, where $d=224^2\times 3 = 150528$. We test the probabilistic robustness of a pre-trained ResNet-18 model \citep{resnet} under uniform noise of size $\varepsilon = 0.055$, around a clean image. Figure \ref{fig:resnet18} illustrates the convergence of ADV-IS, MALA, and MLS estimation methods. In contrast to previous experiments, we see that the convergence rate of ADV-IS is worse than SMC-based methods. We attribute this poor performance to the high dimension of the problem, leading to catastrophic weight degeneracy, as mentioned above. In this case, it seems that SMC methods are more reliable than the proposed adversarial attack-based Importance Sampling. Thus, proposing a method that is both highly efficient for moderately high-dimensional data and reliable even for very high-dimensional data remains an important direction for future research in probabilistic robustness assessment.

\begin{figure}[!htb]
  \centering
  \includegraphics[width=1.\linewidth]{figures/ImageNet/comp_methods_model_resnet18_imagenet_noise_dist_uniform_input_index_12.pdf}
  \caption{Convergence of different estimators w.r.t. the number of calls to the ResNet-18.} 
  \label{fig:resnet18}
\end{figure}

