\section{Average-Case Robustness Estimation}
\label{sec:methods}

\newcommand{\E}{\mathop{\mathbb{E}}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\X}{\mathbf{x}}
\newcommand{\W}{\mathbf{w}}
\newcommand{\U}{\mathbf{u}}
\newcommand{\matU}{\mathbf{U}}

\newcommand{\grad}{\nabla_{\X}}
\newcommand{\cdf}{\Phi_{\matU \matU^\top}}

\newtheorem{defn}{Definition}
\newtheorem{thm}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{remark}{Remark}

\AtAppendix{\counterwithin{lemma}{section}}
\AtAppendix{\counterwithin{thm}{section}}

\newenvironment{hproof}{%
  \renewcommand{\proofname}{Proof Idea}\proof}{\endproof}

In this section, we first describe the mathematical problem of average-case robustness estimation. Then, we present the naïve estimator based on Monte Carlo sampling and derive more efficient analytical estimators. 

\subsection{Notation and Preliminaries}

Assume that we have a neural network $f: \R^d \rightarrow \R^C$ with $C$ output classes and that the classifier predicts class $t \in [1,..., C]$ for a given input $\X \in \R^d$, i.e., $t~=~\arg \max_{i=1}^{C} f_i(\X)$, where $f_i$ denotes the logits for the $i^{th}$ class. Given this classifier, the average-case robustness estimation problem is to compute the probability of consistent classification (to class $t$) under noise perturbation of the input. 

\begin{defn} We define the \textbf{average-case robustness} of a classifier $f$ to noise $\mathcal{R}$ at a point $\X$ as
\begin{align*}
    p^\mathrm{robust}(\X, t) = P_{\epsilon \sim R} \left[ \arg\max_i f_i(x+ \epsilon) = t \right]
\end{align*}
\end{defn}

The more robust the model is in the local neighborhood around $\X$, the larger the average-case robustness measure $p^\text{robust}$($\X, t$). In this paper, given that robustness is always measured with respect to the predicted class~$t$ at $\X$, we henceforth suppress the dependence on $t$ in the notation. We also explicitly show the dependence of $p^\text{robust}$ on the noise scale $\sigma$ by denoting it as $p^\text{robust}_{\sigma}$. 

In this work, we shall consider $\mathcal{R}$ as an isotropic Normal distribution, i.e., $\mathcal{R} = \mathcal{N}(0, \sigma^2)$. However, as we shall discuss in the next section, it is possible to accommodate both non-isotropic and non-Gaussian distributions in our method. Note that for high-dimensional data ($d \rightarrow \infty$), the isotropic Gaussian distribution converges to the uniform distribution on the surface of the sphere with radius $r = \sigma \sqrt{d}$ \footnote{Alternately, if $\epsilon \sim \mathcal{N}(0, \sigma^2 / d)$, then $r = \sigma$} due to the concentration of measure phenomenon \cite{vershynin2018high}. 

Observe that when the domain of the input noise is restricted to an $\ell_p$ ball, $p^\text{robust}_{\sigma}$ generalizes the corresponding $\ell_p$ adversarial robustness. In other words, adversarial robustness is concerned with the quantity $\mathbf{1}(p^\text{robust}_{\sigma} < 1)$, i.e., the indicator function that average-case robustness is less than one (which indicates the presence of an adversarial perturbation), while this work focuses on computing the quantity \probust{} itself. In the rest of this section, we derive estimators for \probust{}.

\paragraph{The Monte-Carlo estimator.}
A naïve estimator of average-case robustness is the Monte-Carlo estimator \pmc{}. It computes the robustness of a classifier $f$ at input $\X$ by generating $M$ noisy samples of $\X$ and then calculating the fraction of these noisy samples that are classified to the same class as $\X$. In other words,

\begin{align*}
    p_{\sigma}^\text{robust}(\X) &=P_{\epsilon \sim \mathcal{N}(0,\sigma^2)} \left[ \arg\max_i f_i(\X+ \epsilon) = t \right] \\
    &= \E_{\epsilon \sim \mathcal{N}(0,\sigma^2)} \left[ \mathbf{1}_{\arg\max_i f_i(\X+ \epsilon) = t} \right] \\
    &\approx \frac{1}{M} \sum_{j=1}^{M} \left[ \mathbf{1}_{\arg\max_i f_i(\X+ \epsilon_j) = t} \right]
    = p_{\sigma}^\text{mc}(\X)
\end{align*}

\pmc{} replaces the expectation with the sample average of the $M$ noisy samples of $\X$ and has been used in prior work \citep{nanda2021fairness}. Technically, the error for the Monte-Carlo estimator is independent of dimensionality and is given by $\mathcal{O}(1/ \sqrt{M})$ \citep{vershynin2018high}. However, in practice, for neural networks, \pmc{} requires a large number of random samples to converge to the underlying expectation. For example, for MNIST and CIFAR10 CNNs, it takes around $M = 10,000$ samples per point for \pmc{} to converge, which is computationally expensive, and further, provides little information regarding the decision boundaries of the underlying model. Thus, we set out to address this problem by developing more efficient and informative analytical estimators of average-case robustness.

\subsection{Robustness Estimation via Linearization}

Before deriving analytical robustness estimators for non-linear models, we first consider the simpler problem of deriving this quantity for linear models. This is challenging, especially for multi-class classifiers. For example, given a linear model for a three-class classification problem with weights $w_1, w_2, w_3$ and biases $b_1, b_2, b_3$, such that $y = \arg \max_i \{w_i^\top\X + b_i \mid i \in [1,2,3] \}$, the decision boundary function between classes $i$ and $j$ is given by $y_{ij} = (w_i - w_j)^\top \X + (b_i - b_j)$. If the predicted label at $\X$ is $y = 1$, the relevant decision boundary functions are $y_{12}, y_{13}$ which characterize the decision boundaries of misclassifications from class $1$ to classes $2, 3$ respectively. To compute the total probability of misclassification, we must compute the probability of decision boundaries $y_{12}, y_{13}$ being crossed separately. Crucially, it is important not to ``double count'' the probability of both $y_{12}$ and $y_{13}$ being simultaneously crossed. Computing the probability of falling into this problematic region is non-trivial, as it depends on the relative orientations of $y_{12}$ and $y_{13}$. If they are orthogonal, then this problem is avoided, as the probability of crossing $y_{12}$ and $y_{13}$ are independent random variables. However, this is not true in general for non-orthogonal decision boundaries.
Further, this ``double counting'' problem increases in complexity with an increasing number of classes, stemming from a corresponding increase in the number of such pairwise decision boundaries. Lemma \ref{estimator-linear-models} provides an elegant solution to this combinatorial problem via the multivariate Gaussian CDF.

\textbf{Notation}: For clarity, we represent tensors by collapsing along the "class" dimension, i.e., $a_i ~ \big\vert_{i=1}^C := (a_1, a_2, ... a_i, ... a_c)$, where for an order-$t$ tensor $a_i$, the expansion $a_i~ \big\vert_{i=1}^C$ is an order-$(t+1)$ tensor. 

\newcommand{\tensor}{~\bigg\vert_{\substack{i = 1\\i \neq t}}^{C}}

\begin{lemma}
The local robustness of a multi-class linear model $f(\X) = \mathbf{w}^\top \X + b$ (with $\mathbf{w} \in \R^{d \times C}$ and $b \in \R^C$) at point $\X$ with respect to a target class $t$ is given by the following. Define weights $\U_i = \W_t - \W_i \in \R^d, \forall i \neq t$, where $\W_t, \W_i$ are rows of $\mathbf{w}$ and biases $c_i = {\U_i}^\top\X + (b_t - b_i) \in \R$. Then, 
\begin{align*}
    p^\mathrm{robust}_\sigma(\X) = \cdf \left( \frac{c_i}{\sigma \| \U_i \|_2} \tensor \right)\\
    \mathrm{where}~~\matU = \frac{\U_i}{\| \U_i \|_2} \tensor \in \R^{(C-1) \times d}
\end{align*}
and $\cdf$ is the ($C-1$)-dimensional Normal CDF with zero mean and covariance $\matU \matU^\top$.
\label{estimator-linear-models}
\end{lemma}

\begin{hproof}
    The proof involves constructing decision boundary functions $g_i(\X) = f_t(\X) - f_i(\X)$ and computing the probability $p^\text{robust}_{\sigma}(\X) = P_{\epsilon}(\bigcup_{\substack{i=1\\i \neq t}}^{C} g_i(\X + \epsilon) > 0)$. For Gaussian $\epsilon$, we observe that $\frac{\U}{\sigma \| \U \|_2}^\top \epsilon \sim \mathcal{N}(0, 1)$ is also a Gaussian, which applied vectorially results in our usage of $\Phi$. As convention, we represent $\matU$ in a normalized form to ensure that its rows are unit norm.
\end{hproof}

The proof is in Appendix~\ref{app:proofs}. Thus, the multivariate Gaussian CDF provides an elegant solution to the previously mentioned ``double counting'' problem. Here, the matrix $\matU$ exactly captures the linear decision boundaries, and the covariance matrix $\matU \matU^\top$ encodes the alignment between pairs of decision boundaries of different classes. 

\textbf{Remark.} For the binary classification case, we get $\matU \matU^\top = 1$ (a scalar), and $p^\text{robust}_{\sigma}(\X) = \phi(\frac{c}{\sigma \| \U \|_2} )$, where $\phi$ is the CDF of the scalar standard normal, which was previously also shown by \citet{weng2019proven, pawelczyk2022probabilistically}. Hence Lemma \ref{estimator-linear-models} is a multi-class generalization of these works.

If the decision boundary vectors $\U_i$ are all orthogonal to each other, then the covariance matrix $\matU \matU^\top$ is the identity matrix. For diagonal covariance matrices, the multivariate Normal CDF (\emph{mvn-cdf}) can be written as the product of univariate Normal CDFs, which is easy to compute. However, in practice, we find that the covariance matrix is strongly non-diagonal, indicating that the decision boundaries are not orthogonal to each other. This non-diagonal nature of covariance matrices in practice leads to the resulting \emph{mvn-cdf} not having a closed form solution, and thus needing to be approximated via sampling \cite{botev2017normal, SciPy}. However, this sampling is performed in the $(C-1)$-dimensional space as opposed to the $d$-dimensional space that \pmc{} samples from. In practice, for classification problems, we often have $C << d$, making sampling in $(C-1)$-dimensions more efficient. We would like to stress here that the expression in Lemma \ref{estimator-linear-models} represents the simplest expression to compute the average-case robustness: the usage of the multi-variate Gaussian CDF cannot be avoided due to the computational nature of this problem.
We now discuss the applicability of Lemma \ref{estimator-linear-models} to non-Gaussian noise. 

\begin{lemma} \label{lemma:universality}(\textbf{Application to non-Gaussian noise})
    For high-dimensional data ($d \rightarrow \infty$), Lemma \ref{estimator-linear-models} generalizes to any coordinate-wise independent noise distribution that satisfies Lyapunov's condition. 
\end{lemma} 

\begin{hproof}
    Applying Lyupanov's central limit theorem \cite{patrick1995probability}, given $\epsilon \sim \mathcal{R}$ is sampled from some distribution $\mathcal{R}$, we have $\frac{\U}{\sigma \| \U \|_2}^\top \epsilon = \sum_{j=1}^{d} \frac{\U_j}{\sigma\| \U \|_2} \epsilon_j ~~\substack{d\\\longrightarrow} ~~\mathcal{N}(0, 1)$, which holds as long as the sequence $\{\frac{\U_j}{\| \U \|_2} \epsilon_j\}$ are independent random variables and satisfy the Lyapunov condition, which encodes the fact that higher-order moments of such distributions progressively shrink. %For the proof of Lemma \ref{estimator-linear-models}, the fact that this output distribution is Gaussian results in the usage of the multivariate Gaussian CDF $\Phi$ over the outputs.
\end{hproof}

Thus, as long as the input noise distribution is ``well-behaved'', the central limit theorem ensures that the distribution of high-dimensional dot products is Gaussian, thus motivating our use of the \emph{mvn-cdf} more generally beyond Gaussian input perturbations. We note that it is also possible to easily generalize Lemma \ref{estimator-linear-models} to \textbf{non-isotropic} Gaussian perturbations with a covariance matrix $\mathcal{C}$, which only changes the form of the covariance matrix of the \emph{mvn-cdf} from $\matU\matU^\top \rightarrow \matU \mathcal{C} \matU^\top$, which we elaborate in Appendix \ref{app:proofs}. In the rest of this paper, we focus on the isotropic case. 

\subsubsection{Estimator 1: The Taylor Estimator} %\boldmath \ptaylor{}}

Using the estimator derived for multi-class linear models in Lemma \ref{estimator-linear-models}, we now derive the Taylor estimator, a local robustness estimator for non-linear models.


\begin{defn}
    The \textbf{Taylor estimator} for the local robustness of a classifier $f$ at point $\X$ with respect to target class $t$ is given by linearizing $f$ around $\X$ using a first-order Taylor expansion, with decision boundaries $g_i(\X) = f_t(\X) - f_i(\X)$, $\forall i \neq t$, leading to
    \begin{align*}
        p^\mathrm{taylor}_{\sigma}(\X) = \cdf \left( \frac{g_i(\X)}{\sigma \|\grad g_i(\X)\|_2} \tensor \right) 
    \end{align*}
    with $\matU$ and $\Phi$ defined as in the linear case.
\label{eqn:taylor-estimator}
\end{defn}

The proof is in Appendix~\ref{app:proofs}. It involves locally linearizing non-linear decision boundary functions $g_i(\X)$ using a Taylor series expansion. We expect this estimator to have a small error when the underlying model is well-approximated by a locally linear function in the local neighborhood. We formalize this intuition by computing the estimation error for a quadratic classifier. 

\begin{thm} The \textbf{estimation error} of the Taylor estimator for a classifier with a quadratic decision boundary $g_i(\X) = \X^\top A_i \X + \U_i^\top \X + c_i$ and positive-semidefinite $A_i$ is upper bounded by
    \begin{align*}
        | p^\mathrm{robust}_{\sigma}(\X) - p^\mathrm{taylor}_{\sigma}(\X) | \leq k \sigma^{C-1} \prod_{\substack{i=1\\i\neq t}}^{C} \frac{\lambda_{\max}^{A_i}}{\| \U_i \|_2} 
    \end{align*}
    for noise $\epsilon \sim \mathcal{N}(0, \sigma^2 / d)$, in the limit of $d \rightarrow \infty$. Here, $\lambda_{\max}^{A_i}$ is the max eigenvalue of $A_i$, and $k$ is a small problem dependent constant.
\end{thm} 

The proof is in Appendix~\ref{app:proofs}. This statement formalizes two key intuitions with regards to the Taylor estimator: (1) the estimation error depends on the size of the local neighborhood $\sigma$ (the smaller the local neighborhood, the more locally linear the model, and the smaller the estimator error), and (2) the estimation error depends on the extent of non-linearity of the underlying function, which is given by the ratio of the max eigenvalue of $A$ to the Frobenius norm of the linear term. This measure of non-linearity of a function, called normalized curvature, has also been independently proposed by previous work \cite{srinivas2022efficient}. Notably, if the max eigenvalue is zero, then the function $g_i(\X)$ is exactly linear, and the estimation error is zero, reverting back to the linear case in Lemma \ref{estimator-linear-models}.

\subsubsection{Estimator 2: The MMSE Estimator} %\boldmath \pmmse{}}

While the Taylor estimator is more efficient than the Monte Carlo estimator, it has a drawback: its linearization is only faithful at perturbations close to the data point and not necessarily for larger perturbations. To mitigate this issue, we use a form of linearization that is faithful over larger noise perturbations. Linearization has been studied in feature attribution research, which concerns itself with approximating non-linear models with linear ones to produce model explanations \cite{han2022explanation}. In particular, the SmoothGrad \cite{smilkov2017smoothgrad} technique has been described as the MMSE (minimum mean-squared error) optimal linearization of the model \cite{han2022explanation, agarwal2021towards} in a Gaussian neighborhood around the data point. Using a similar idea, we propose the MMSE estimator \pmmse{} as follows.

\begin{defn}
    The \textbf{MMSE estimator} for the local robustness of a classifier $f$ at point $\X$ with respect to target class $t$ is given by an MMSE linearization $f$ around $\X$, for decision boundaries $g_i(\X) = f_t(\X) - f_i(\X)$, $\forall i \neq t$, leading to
    \begin{align*}
        &p^\mathrm{mmse}_{\sigma}(\X) = \cdf \left( \frac{ \Tilde{g}_i(\X)}{\sigma \| \grad \Tilde{g}_i(\X)\|_2} \tensor \right) \\
        &\mathrm{where}~~\Tilde{g}_i(\X) = \frac{1}{N}\sum_{j=1}^{N} g_i(\X + \epsilon) ~,~ \epsilon \sim \mathcal{N}(0, \sigma^2)
    \end{align*}
    with $\matU$ and $\Phi$ defined as in the linear case, and $N$ is the number of perturbations. 
\end{defn}

The proof is in Appendix~\ref{app:proofs}. It involves creating a randomized smooth model \cite{cohen2019certified} from the base model and computing the decision boundaries of this smooth model. Note that this estimator also involves drawing noise samples like the Monte Carlo estimator. However, unlike the Monte Carlo estimator, we find that the MMSE estimator converges fast (around $N = 5$), leading to an empirical advantage. We now compute the estimation error of the MMSE estimator.

\newcommand{\mn}{\text{mean}}

\begin{thm} The \textbf{estimation error} of the MMSE estimator for a classifier with a quadratic decision boundary $g_i(\X) = \X^\top A_i \X + \U_i^\top \X + c_i$ and positive-semidefinite $A_i$ is upper bounded by
    \begin{align*}
        | p^\mathrm{robust}_{\sigma}(\X) - p^\mathrm{mmse}_{\sigma}(\X) | \leq k \sigma^{C-1} \prod_{\substack{i=1\\i\neq t}}^C \frac{\lambda_{\max}^{A_i} - \lambda_{\mn}^{A_i}}{\| \U_i \|_2}  
    \end{align*}
    for noise $\epsilon \sim \mathcal{N}(0, \sigma^2 / d)$, in the limit of $d \rightarrow \infty$ and $N \rightarrow \infty$. Here, $\lambda_{\max}^{A_i}, \lambda_{\mn}^{A_i}$ are the maximum and mean eigenvalue of $A_i$ respectively, and $k$ is a small problem dependent constant. 
\end{thm}

The proof is in Appendix~\ref{app:proofs}. The result above highlights two aspects of the MMSE estimator: (1) it incurs a smaller estimation error than the Taylor estimator, and (2) even in the limit of large number of samples $N \rightarrow \infty$, the error of the MMSE estimator is non-zero, except when $\lambda_{\mn}^{A_i} = \lambda_{\max}^{A_i}$. For PSD matrices, this becomes zero when $A_i$ is a multiple of the identity matrix \footnote{When $d \rightarrow \infty$, $\epsilon^\top A \epsilon = \lambda \| \epsilon \|^2 = \lambda \sigma^2$ is a constant, and thus an isotropic quadratic function resembles a linear one in this neighborhood. }, reverting back to the linear case in Lemma \ref{estimator-linear-models}. 


\subsubsection{(Optionally) Approximating \emph{mvn-cdf}: Connecting Robustness Estimation with Softmax}


\paragraph{Approximation with Multivariate Sigmoid.} One drawback of the Taylor and MMSE estimators is their use of the \emph{mvn-cdf}, which does not have a closed form solution and can cause the estimators to be slow for settings with a large number of classes $C$. In addition, the \emph{mvn-cdf} makes these estimators non-differentiable, which is inconvenient for applications which require differentiating \probust{}. To alleviate these issues, we approximate the \emph{mvn-cdf} with an analytical closed-form expression. As CDFs are monotonically increasing functions, the approximation should also be monotonically increasing.

To this end, it has been previously shown that the \emph{univariate} Normal CDF $\phi$ is well-approximated by the sigmoid function \cite{hendrycks2016gaussian}. It is also known that when $\matU \matU^\top = I$, \emph{mvn-cdf} is given by $\Phi(\X) = \prod_i\phi(\X_i)$, i.e., it is given by the product of the univariate normal CDFs. Thus, we may choose to approximate $\Phi(\X) = \prod_i \text{sigmoid}(\X)$. However, when the inputs are small, this can be simplified as follows:

\begin{align*}
    &\Phi_{I}(\X) = \prod_i \phi(\X_i) \approx \prod_i \frac{1}{1 + \exp(-\X_i)}\\
    &= \frac{1}{1 + \sum_i \exp(-\X_i) + \sum_{j,k} \exp(-\X_j - \X_k) + ...} \\
    &\approx \frac{1}{1 + \sum_i \exp(-\X_i)} ~~~(\text{for} ~~\X_i \rightarrow \infty~~ \forall i)
\end{align*}

%\begin{defn}
%    The multivariate sigmoid is defined as $\text{mv-sigmoid}(\X) = \frac{1}{1 + \sum_{i} \exp(-\X_i)}$ 
%\end{defn}

We call the final expression the ``multivariate sigmoid'' (\emph{mv-sigmoid}) which serves as our approximation of \emph{mvn-cdf}, especially at the tails  of the distribution. While we expect estimators using \emph{mv-sigmoid} to approximate ones using \emph{mvn-cdf} only when $\matU \matU^\top = \mathbf{I}$, we find experimentally that the approximation works well even for practical values of the covariance matrix $\matU\matU^\top$. Using this approximation to substitute \emph{mv-sigmoid} for \emph{mvn-cdf} in the \ptaylor{} and \pmmse{} estimators yields the \ptaylormvs{} and \pmmsemvs{} estimators, respectively. We present further analysis on the multivariate sigmoid in Appendix \ref{app:experiments}.



\paragraph{Approximation with Softmax.} A common method to estimate the confidence of model predictions is to use the softmax function applied to the logits $f_i(\X)$ of a model. We note that softmax is identical to \emph{mv-sigmoid} when directly applied to the logits of neural networks: 

\begin{align*}
    &\text{softmax}_t\left( f_i(\X) ~\Big\vert_{\substack{i = 1}}^{C} \right) = \frac{\exp(f_t(\X))}{\sum_{i=1}^C \exp(f_i(\X))} = \\& \frac{1}{1 + \sum\limits_{\substack{i=1\\i \neq t}}^{C} \exp(f_i(\X) - f_t(\X))} = \text{mv-sigmoid}\left( g_i(\X) ~\Big\vert_{\substack{i = 1\\i\neq t}}^{C} \right)
\end{align*}

Recall that $g_i(\X) = f_t(\X) - f_i(\X)$ is the decision boundary function. Note that this equivalence only holds for the specific case of logits, and cannot be applied to approximate the Taylor estimator, for instance. Nonetheless, given this similarity, it is reasonable to ask whether softmax applied to logits (henceforth $p^\text{softmax}_{T}$ for softmax with temperature $T$) itself can be a ``good enough'' estimator of $p^\text{robust}_{\sigma}$ in practice. In other words, does $p^\text{softmax}_T$ well-approximate $p^\text{robust}_{\sigma}$ in certain settings?
In Appendix \ref{app:proofs}, we provide a theoretical result for a restricted linear setting where softmax can indeed match the behavior of \ptaylormvs{}, which happens precisely when $\matU \matU^\top = \mathbf{I}$ and all the class-wise gradients are equal. In the next section, we demonstrate empirically that the softmax estimator $p^{\text{softmax}}_T$ is a poor estimator of average-case robustness in practice.



