\section{Empirical Evaluation}
\label{sec:exp}

%fig1 -- method properly approximates p_empirical
%fig1a: method works
%02d_pemp_vs_pothers_over_sigma/rel/fmnist_cnn.png
%fig1b: method works better for robust models
%02e_pemp_vs_pmmse_over_sigma_robust_models/rel/cifar10_resnet18.png
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.6\textwidth}
        \centering
        % \includegraphics[width=\textwidth]{figures/fig1a_fmnist_cnn.pdf}
        \includegraphics[width=\textwidth]{figures/appendix/d_accuracy_of_estimators/cifar10_resnet18.pdf}
        \captionsetup{justification=centering}
        \caption{CIFAR10, ResNet18 \\ Comparing estimator errors}
        \label{fig1a:method-works-over-sigma}
    \end{subfigure}
    \hspace{1em}
    \begin{subfigure}{0.35\textwidth}
        \centering
        \includegraphics[width=\textwidth]{figures/fig1b_cifar10_resnet18.pdf}
        \captionsetup{justification=centering}
        \caption{CIFAR10, ResNet18 \\ Varying model robustness}
        \label{fig1b:method-works-robust}
    \end{subfigure}
    \caption{Empirical evaluation of analytical estimators. (a) The smaller the noise neighborhood $\sigma$, the more accurately the estimators compute \probust{}. \pmmse{} and \pmmsemvs{} are the best estimators of \probust{}, followed closely by \ptaylormvs{} and \ptaylor{}, trailed by \psoftmax{}. (b) For more robust models, the estimators compute \probust{} more accurately over a larger $\sigma$. Together, these results indicate that the analytical estimators accurately compute \probust{}.}
    \label{fig1:method-works}
\end{figure*}

In this section, we first evaluate the estimation errors and computational efficiency of the analytical estimators, and then evaluate the impact of robustness training within models on these estimation errors. Then, we analyze the relationship between average-case robustness and softmax probability. Lastly, we demonstrate the usefulness of local robustness for model and dataset understanding with two case studies. Key results are discussed in this section and full results are in Appendix~\ref{app:experiments}.


\textbf{Datasets and models.}
We evaluate the estimators on four datasets: MNIST \citep{deng2012mnist}, FashionMNIST \citep{xiao2017fashion}, CIFAR10 \citep{krizhevsky2009learning}, and CIFAR100 \citep{krizhevsky2009learning}. For MNIST and FashionMNIST, we train linear models and CNNs. For CIFAR10 and CIFAR100, we train Transformer models. We also train ResNet18 models \citep{he2016deep} using varying levels of gradient norm regularization~\cite{srinivas2018knowledge, srinivas2024models} to obtain models with varying levels of robustness. 
For gradient norm regularization, the objective function is $\ell(f(x), y) + \lambda \|\nabla_x f(x)\|_2^2$, where $\lambda$ is the regularization constant. The larger $\lambda$ is, the more robust the model.
Note that gradient norm regularization is equivalent to Gaussian data augmentation with an infinite number of augmented samples~\cite{srinivas2018knowledge} and is different from adversarial training.
Unless otherwise noted, the experiments below use each dataset's test set which consists of 10,000 points. Additional details about the datasets and models are described in Appendix~\ref{app:datasets} and \ref{app:models}.



\subsection{Evaluation of the estimation errors of analytical estimators}
\label{sec:exp_correctness}



\textbf{The analytical estimators accurately compute local robustness.}
To empirically evaluate the estimation error of our estimators, we calculate \probust{} for each model using \pmc{}, \ptaylor{}, \pmmse{}, \ptaylormvs{}, \pmmsemvs{}, and \psoftmax{} for different $\sigma$ values. For \pmc{}, \pmmse{}, and \pmmsemvs{}, we use a sample size at which these estimators have converged ($n=10000, 500, \text{and } 500$, respectively). (Convergence analyses are in Appendix~\ref{app:experiments}.) We take the Monte-Carlo estimator as the gold standard estimate of $p^{robust}_{\sigma}$), and compute the absolute and relative difference between \pmc{} and the other estimators to evaluate their estimation errors. 

%pmmse family = best estimator
The performance of the estimators for the CIFAR10 ResNet18 model is shown in Figure~\ref{fig1a:method-works-over-sigma}. The results indicate that \pmmsemvs{} and \pmmse{} are the best estimators of \probust{}, followed closely by \ptaylormvs{} and \ptaylor{}, trailed by \psoftmax{}. This is consistent with the theory in Section~\ref{sec:methods}, where the analytical estimation errors of $p^{mmse}_{\sigma}$ are lower than $p^{taylor}_{\sigma}$.

%smaller noise neighborhood, better approximation
The results also confirm that the smaller the noise neighborhood $\sigma$, the more accurately the estimators compute \probust{}. For the MMSE and Taylor estimators, this is because their linear approximation of the model around the input is more faithful for smaller $\sigma$. As expected, when the model is linear, \ptaylor{} and \pmmse{} accurately compute \probust{} for all $\sigma$'s (Appendix~\ref{app:experiments}). For the softmax estimator, \psoftmax{} values are constant over $\sigma$ and this particular model has high \psoftmax{} values for most points. Thus, for small $\sigma$'s where \probust{} is near one, \psoftmax{} happens to approximate \probust{} for this model. Examples of images with varying levels of noise ($\sigma$) are in Appendix~\ref{app:experiments}.

\textbf{Impact of robust training on estimation errors.} 
The performance of \pmmse{} for CIFAR10 ResNet18 models of varying levels of robustness is shown in Figure~\ref{fig1b:method-works-robust}. The results indicate that the estimator is more accurate for more robust models (larger $\lambda$) over a larger $\sigma$. This is because robust training leads to models that are more locally linear \cite{moosavi2019robustness}, making the estimator's linear approximation of the model around the input more accurate over a larger $\sigma$, making its \probust{} values more accurate.


\textbf{Evaluating estimation error of mv-sigmoid.} To examine \emph{mv-sigmoid}'s approximation of \emph{mvn-cdf}, we compute both functions using the same inputs ($z~=~ \frac{g_i(\X)}{\sigma \|\grad g_i(\X)\|_2} \vert_{\substack{i=1\\i\neq t}}^C$, as described in Proposition~\ref{eqn:taylor-estimator}) for the CIFAR10 ResNet18 model for different $\sigma$. The plot of \emph{mv-sigmoid(z)} against \emph{mvn-cdf(z)} for $\sigma=0.05$ is shown in Appendix~\ref{app:experiments} (Figure~\ref{fig2:mvsig-mvncdf}). The results indicate that the two functions are strongly positively correlated with low approximation error, suggesting that \emph{mv-sigmoid} approximates the \emph{mvn-cdf} well in practice.






\subsection{Evaluation of computational efficiency of analytical estimators}


\textbf{The analytical estimators are more efficient than the naïve estimator.}
We examine the efficiency of the estimators by measuring their runtimes when calculating \probustwsigma{0.1} for the CIFAR10 ResNet18 model for 50 points. Runtimes are displayed in Table~\ref{table:runtimes}. They indicate that \ptaylor{} and \pmmse{} perform 35x and 17x faster than \pmc{}, respectively. Additional runtimes are in Appendix~\ref{app:experiments}.

%table: naive method is inefficient, analytical method is efficient
\begin{table}[ht!]
\centering
\begin{tabular}{l|l|l|l }
    Estimator   & \thead{Number of \\Samples ($n$)}   & \thead{CPU\\Runtime\\(h:m:s)}  & \thead{GPU\\Runtime\\(h:m:s)} \\
    \toprule
    \pmc{}   & \begin{tabular}[c]{@{}l@{}}  $n=10,000$\end{tabular}               
             & \begin{tabular}[c]{@{}l@{}}  1:41:11\end{tabular}                 
             & \begin{tabular}[c]{@{}l@{}}  0:19:56\end{tabular}  \\
    \ptaylor{}   & N/A
                 & 0:00:08                                                                   
                 & 0:00:02  \\
    \pmmse{}   & \begin{tabular}[c]{@{}l@{}} $n=5$\end{tabular} 
               & \begin{tabular}[c]{@{}l@{}} 0:00:41\end{tabular} 
               & \begin{tabular}[c]{@{}l@{}} 0:00:06\end{tabular} \\              
\end{tabular}
\vspace{0.2cm}
\caption{Runtimes of \probust{} estimators. Each estimator computes \probustwsigma{0.1} for the CIFAR10 ResNet18 model for 50 data points. Estimators that use sampling use the minimum number of samples necessary for convergence. Runtimes are in the format of hour:minute:second. The GPU used was a Tesla V100. The analytical estimators (\ptaylor{} and \pmmse{}) are more efficient than the naïve estimator (\pmc{}).} 
\vspace{-0.5cm}
\label{table:runtimes}
\end{table}


We also examine the efficiency of the analytical estimators in terms of memory usage. The backward pass is observed to take about twice the amount of floating-point operations (FLOPs) as a forward pass~\cite{flops}. In addition, we performed an experiment and found that a forward and backward pass uses about twice the peak memory of a single forward pass. Thus, each iteration of \pmmse{} (which consists of a forward and backward pass) is roughly 3x the number of FLOPs and twice the peak memory of a single iteration of \pmc{} (which consists of one forward pass). However, \pmmse{} requires 5 iterations for convergence while \pmc{} requires about 10,000. Thus, overall, \pmmse{} is more memory-efficient than \pmc{}.



\subsection{Case Studies}
\label{subsec:case-studies}

\textbf{Identifying non-robust data points.} While robustness is typically viewed as the property of a model, the average-case robustness perspective compels us to view robustness as a joint property of both the model and the data point. In light of this, we can ask, given the same model, which samples are robustly and non-robustly classified? We evaluate whether \probust{} can distinguish such images better than \psoftmax{}. To this end, we train a simple CNN to distinguish between images with high and low \pmmse{} and the same CNN to also distinguish between images with high and low \psoftmax{} (additional setup details described in Appendix~\ref{app:experiments}). Then, we compare the performance of the two models. For CIFAR10, the test set accuracy for the \pmmse{} CNN is $\mathbf{92\%}$ while that for the \psoftmax{} CNN is $\textbf{58\%}$. These results indicate that \probust{} better identifies images that are robust to and vulnerable to random noise than \psoftmax{}.

We also present visualizations of images with the highest and lowest \pmmse{} in each class for each model. For comparison, we do the same with \psoftmax{}. Example CIFAR10 images are shown in Figure~\ref{fig4:topk-vs-bottomk-main}. We observe that images with low \probust{} tend to have neutral colors, with the object being a similar color as the background (making the prediction likely to change when the image is slightly perturbed), while images with high \probust{} tend to be brightly-colored, with the object strongly contrasting with the background (making the prediction likely to stay constant when the image is slightly perturbed). Recall that points with small \probust{} are close to the decision boundary, while those farther away have a high \probust{}. Thus, high \probust{} points may be thought of as ``canonical'' examples of the underlying class, while low \probust{} examples are analogous to ``support vectors'', that are critical to model learning. These results showcase the utility of average-case robustness for dataset exploration and analysis, particularly in identifying canonical and outlier examples.


\begin{figure}[htbp!]
    \centering
    \begin{flushleft}
        %row labels
        \hspace{-0.1cm}\rotatebox{90}{\hspace{-9.4cm} \hspace{3cm}Car \hspace{2.9cm}Boat}
        %column labels
        \hspace{1.1cm}Lowest \pmmsewsigma{0.1}
        \hspace{1.5cm} Highest \pmmsewsigma{0.1}
    \end{flushleft}
         
    \begin{subfigure}{0.22\textwidth}
        \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar10_resnet18_p_mmse_sigma0.1_class8_bottomk.pdf}
    \end{subfigure}
    \begin{subfigure}{0.22\textwidth}
        \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar10_resnet18_p_mmse_sigma0.1_class8_topk.pdf}
    % \hspace{0.2cm}
    \end{subfigure}
    \begin{subfigure}{0.22\textwidth}
        \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar10_resnet18_p_mmse_sigma0.1_class1_bottomk.pdf}
    \end{subfigure}
    \begin{subfigure}{0.22\textwidth}
        \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar10_resnet18_p_mmse_sigma0.1_class1_topk.pdf}
    \end{subfigure}
    %2nd batch
    %    \begin{subfigure}{0.22\textwidth}
    %    \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar100_resnet18_p_mmse_sigma0.05_class8_bottomk.pdf}
    %\end{subfigure}
    %\begin{subfigure}{0.22\textwidth}
    %    \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar100_resnet18_p_mmse_sigma0.05_class8_topk.pdf}
    % \hspace{0.2cm}
    %\end{subfigure}
    %\begin{subfigure}{0.22\textwidth}
    %    \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar100_resnet18_p_mmse_sigma0.05_class23_bottomk.pdf}
    %\end{subfigure}
    %\begin{subfigure}{0.22\textwidth}
    %    \includegraphics[width=\linewidth, trim={0.2cm, 0.2cm, 0.2cm, 0.2cm}]{figures/appendix/h_topk_bottomk/cifar100_resnet18_p_mmse_sigma0.05_class23_topk.pdf}
    %\end{subfigure}    
    \caption{Example ranking of \probust{} among CIFAR10 classes. Images with high \probust{} are farther away from the decision boundary, and tend to be brighter and have stronger object-background contrast than those with low \probust{}, which are closer to the decision boundary, and thus easily misclassified.}
    \label{fig4:topk-vs-bottomk-main}
\end{figure}



\begin{figure}[h]
    \centering
    \begin{subfigure}{0.35\textwidth}
        \centering
        \includegraphics[width=\linewidth,  trim={1cm, 0.5cm, 0.8cm, 0cm}]{figures/fig5_cifar10_resnet18_sigma0.09.pdf}
        \caption{CIFAR10, ResNet18}
        \vspace{0.25cm}
    \end{subfigure}
    \begin{subfigure}{0.35\textwidth}
            \centering
          \includegraphics[width=\linewidth, trim={0.9cm, 0.5cm, 1.1cm, 0cm}]{figures/appendix/g_robustness_bias/fmnist_cnn_sigma0.9.pdf}
         \caption{FMNIST, CNN}
    \end{subfigure}
    \caption{Computing robustness bias among classes for the (a) ResNet18 CIFAR10 model, and (b) for the CNN FMNIST model. \probust{} reveals that the model robustness varies significantly across classes, revealing a marked class-wise bias within standard models. The analytical estimator \pmmse{} accurately captures this model bias.}
    \label{fig5:robustness-bias}
\end{figure}

\textbf{Detecting robustness bias among classes: Is the model differently robust for different classes?} We also demonstrate that \probust{} can detect bias in local robustness \cite{nanda2021fairness} by examining its distribution for each class for each model and test set over different $\sigma$'s. Results for the CIFAR10 ResNet18 model are in plotted in Figure~\ref{fig5:robustness-bias}. The results show that different classes have significantly different \probust{} distributions, i.e., the model is significantly more robust for some classes (e.g., frog) than for others (e.g., airplane). Similarly for the FMNIST CNN case in Figure~\ref{fig5:robustness-bias}, we find that the pullover class is much less robust than the sandal class. This observation indicates a disparity in outcomes for these different classes, and underscores the importance of evaluating per-class and per-datapoint robustness metrics before deploying models in the wild.
The results also show that \pmc{} and \pmmse{} have very similar distributions, further indicating that the latter well-approximates the former. \probust{} detects robustness bias across all other models and datasets too: MNIST CNN, and CIFAR100 ResNet18 (Appendix~\ref{app:experiments}). Thus, \probust{} can be applied to detect robustness bias among classes, which is critical when models are deployed in high-stakes, real-world settings.


