\section{Experiments}



We conduct two sets of experiments to validate our key contributions. (\textbf{i}) We show that we can boost certified accuracy for several pre-trained models by using Algorithm \ref{alg:RS_DS} for data dependent smoothing only during certification, \ie without employing any additional training. (\textbf{ii}) Once data dependent smoothing is employed during training, we can improve the certified accuracy even further.
% (Algorithms \ref{alg:RS_DS} and \ref{alg:DS_DS}).
Since our framework is agnostic to the training routine, we incorporate it
% to the best of our knowledge, 
into 
% the only three methods that employ randomized smoothing as part of training, namely 
(\textbf{i}) \textsc{Cohen} \citep{cohen2019certified}, (\textbf{ii}) \textsc{SmoothAdv} \citep{salman2019provably} and (\textbf{iii}) \textsc{MACER} \citep{zhai2020macer}. Throughout, we use $\text{DS}$ to refer to when data dependent smoothing is  used only in certification and $\text{DS}^2$ when it is used during both training and certification.

% \BG{consider using \textsc{\textsc{Cohen}}, \textsc{\textsc{SmoothAdv}}, and \textsc{\textsc{MACER}} instead; they look nicer this way}



\textbf{Setup.} 
We conduct experiments with ResNet-18 and ReNet-50 \citep{resnet} on CIFAR10 \citep{cifars} and ImageNet \citep{imagenet}, respectively. For CIFAR10 experiments, we train from scratch for 200 epochs. For ImageNet, we initialize using the network parameters provided by the authors. When $\sigma$ is fixed and following prior art, \eg \textsc{Cohen}, \textsc{SmoothAdv}, and \textsc{MACER}, we set $\sigma \in \{0.12, 0.25, 0.50\}$ and $\sigma \in \{0.25, 0.50, 1.0\}$ for CIFAR10 and ImageNet, respectively, for training and certification. We set $\alpha=10^{-4}$ in Algorithm \ref{alg:RS_DS} and the initial $\sigma_0$ to the $\sigma$ used in training the respective model. Unless stated otherwise, we set $n=1$ in Algorithm \ref{alg:RS_DS}. Following \textsc{Cohen} and \textsc{SmoothAdv}, we compare models using the approximate certified accuracy curve (simply referred to as certified accuracy) followed by the envelope curve over all $\sigma$. We also report the Average Certified Radius (ACR) proposed by MACER $\nicefrac{1}{|\mathcal S_{test}|} \sum_{(x,y)\in\mathcal S_{test}}  R(f_\theta, x).\mathbbm{1}\{\argmax_c g^c_\theta(x) = y\}$, where $\mathbbm{1}\{.\}$ is an indicator function. Following \textsc{Cohen}
% \footnote{We use the available code of \cite{cohen2019certified} to report all certification results in the paper for a given $\sigma$.}
and all randomized smoothing methods, we certify all results using $N_0=100$ Monte Carlo samples for prediction and $N=100,000$ estimation samples to estimate the radius with a failure probability of $0.001$ given a smoothing $\sigma$.
\input{ figs/Cohen}
\input{ tables/Cohen}

% where $\mathbbm{1}$ is an indicator function and $R$ is the radius in Equation \eqref{eq:certification_radius}.


\input{ figs/SmoothAdv}
\input{ tables/SmoothAdv}

\subsection{\textsc{Cohen} + DS}
% \vspace{-0.15cm}

We combine data dependent smoothing with \textsc{Cohen}. Following Gaussian augmentation, this method trains $f_\theta$ on $(x+\epsilon)$, where $\epsilon \sim \mathcal N (0,\sigma^2I)$, with the cross entropy loss.

% \vspace{-0.10cm}
\textbf{DS for certification only.} We first certify the trained models with the same fixed $\sigma$ used in training for all inputs, dubbed \textsc{Cohen}. Then, we certify using the memory based certification the same trained models with the proposed data dependent $\sigma_x^*$ produced by Algorithm \ref{alg:RS_DS}, which we refer to as $\text{\textsc{Cohen}-DS}$. Figure \ref{fig:GA} plots the certified accuracy for CIFAR10 and ImageNet in the first and second rows, respectively. Even though the base classifier $f_\theta$ is identical for $\text{\textsc{Cohen}}$ and $\text{\textsc{Cohen}-DS}$, Figure \ref{fig:GA} shows that $\text{\textsc{Cohen}-DS}$ is superior to $\text{\textsc{Cohen}}$ in certified accuracy across almost all radii and for all training $\sigma$ on both datasets. This is also evident from the envelope plots in the last column of Figure \ref{fig:GA}.
% reporting the best certified accuracy per radius over all $\sigma$ trained models.
In Table \ref{tb:Cohen}, we report the best certified accuracy per radius over all training $\sigma$ for $\text{\textsc{Cohen}}$ (envelope figure) against our best $\text{\textsc{Cohen}-DS}$, cross-validated over all training $\sigma$ and the number of iterations in Algorithm \ref{alg:RS_DS} $K$, accompanied with the corresponding ACR score. For instance, we observe that data dependent certification $\text{\textsc{Cohen}-DS}$ can significantly boost certified accuracy at radii $0.5$ and $0.75$ by $7.7\%$ (from $40.1$ to $47.8$) and $9.1\%$ (from $29.2\%$ to $38.3\%$), respectively, and by $0.193$ ACR points on CIFAR10. Moreover, we boost the certified accuracy on ImageNet by $4.6\%$  and $3.2\%$ at $0.5$ and $0.75$ radii, respectively, and by $0.159$ ACR points. 


% \vspace{-0.10cm}
\textbf{DS for training and certification.} We employ data dependent smoothing in both training and certification for \textsc{Cohen} models (denoted as $\text{\textsc{Cohen}-DS}^2$) by running Algorithm \ref{alg:DS_DS}. For CIFAR10, we train $\text{\textsc{Cohen}}$ first with fixed $\sigma$ for 50 epochs, \ie $K=0$ in Algorithm \ref{alg:RS_DS}, and then we perform data dependent smoothing with $K=1$ for the remaining 150 epochs. For ImageNet experiments, we only finetune the provided models for 30 epochs using Algorithm \ref{alg:DS_DS} with $K=1$. Once training is complete, we certify all trained models with Algorithm \ref{alg:RS_DS} using the memory based certification. In Figure \ref{fig:GA}, we observe that $\text{\textsc{Cohen}-DS}^2$ can further improve certified accuracy across all trained models on both CIFAR10 and ImageNet. This is also evident in the last column of Figure \ref{fig:GA} that shows the best certified accuracy per radius (envelope) over all training $\sigma$. We note that $\text{\textsc{Cohen}-DS}^2$ improves the certification accuracy of $\text{\textsc{Cohen}-DS}$ by $2.6\%$ and by $0.9\%$ at radii $0.5$ and $0.75$ respectively on CIFAR10, and by $4.8\%$ and $1.8\%$ at radii $0.5$ and $0.75$ respectively on ImageNet. The improvements are consistently present over a wide range of radii on both datasets. We do observe that the ACR score for $\text{\textsc{Cohen}-DS}^2$ on CIFAR10 marginally drops compared to $\text{\textsc{Cohen}-DS}$. We believe that this is due to the fact that some inputs that are classified correctly at the small radii have an overall larger certification radius for $\text{\textsc{Cohen}-DS}$ compared to $\text{\textsc{Cohen}-DS}^2$ on CIFAR10. Regardless, $\text{\textsc{Cohen}-DS}^2$ substantially 
outperforms $\text{\textsc{Cohen}}$ by $0.173$ ACR points. As compared to $\text{\textsc{Cohen}-DS}$, $\text{\textsc{Cohen}-DS}^2$ improves the ACR on ImageNet  from $1.257$ to $1.319$.


% \vspace{-0.15cm}
\subsection{\textsc{SmoothAdv} + DS}
% \vspace{-0.15cm}
We combine our data dependent smoothing strategy with the more effective \textsc{SmoothAdv}, which trains the smoothed classifier for every $x$ on the adversarial example $\hat{x}$ that maximizes $-\log \mathbb {E}_{\epsilon \sim \mathcal N (0,\sigma^2I)} \left[f^y_{\theta}(x' + \epsilon)\right]$, where $\|x'-x\| \leq \zeta$.%\BG{consider changing the variable from $\delta$ to something else; you used $\delta$ as the vector perturbation in Section 3.2}
% \begin{align*}
% \hat{x} = \argmax_{\|x'-x\| \leq \delta} ~-\log \mathbb {E}_{\epsilon \sim \mathcal N (0,\sigma^2I)} \left[f_{\theta}(x' + \epsilon)\right].
% \end{align*}
For CIFAR10 experiments, we follow  the training procedure of \textsc{SmoothAdv}, where the adversary $\hat{x}$ is computed with 2 PGD (proximal gradient descent) steps with $\zeta = 0.25$ and one augmented sample to estimate the expectation. For ImageNet experiments, we use the best reported models, in terms of certified accuracy, provided by the authors, which correspond to $\zeta = 0.5$ for $\sigma = 0.25$ and $\zeta = 1.0$ for $\sigma \in \{0.5, 1.0\}$.

\input{ figs/MACER}
\input{ tables/MACER}
% \vspace{-0.10cm}
\textbf{DS for certification only.} Similar to \textsc{Cohen}, we first certify \textsc{SmoothAdv} models trained with the same fixed $\sigma$. Then, we certify the proposed data dependent $\sigma_{x}^*$ models using the memory-based certification, which we refer to as $\text{\textsc{SmoothAdv}-DS}$. In Figure \ref{fig:SmoothAdv}, we show the certified accuracy for both CIFAR10 and ImageNet in the first and second rows, respectively. The last column shows the envelopes per radius. Even though they both share the same classifier $f_\theta$, $\text{\textsc{SmoothAdv}-DS}$ significantly improves upon $\text{\textsc{SmoothAdv}}$ over all radii and all values of $\sigma$ in training for both CIFAR10 and ImageNet. In particular, for models trained with $\sigma=0.25$, $\text{\textsc{SmoothAdv}}$ achieves a zero certified accuracy for large certification radii ($\ge 1.0$), while $\text{\textsc{SmoothAdv}-DS}$ achieves non-trivial certified accuracy in these cases. Similar to the earlier setup, we report the best certified accuracy along with the ACR scores in Table \ref{tb:SmoothAdv}. We improve over $\text{\textsc{SmoothAdv}}$ by large margins. For example, the certified accuracy at $0.5$ radius increases by $5.4\%$ and $2.8\%$ on CIFAR10 and Imagenet, respectively. The improvement is consistent over all radii. The ACR also improves by $0.118$ and $0.158$ on CIFAR10 and ImageNet, respectively.



% \vspace{-0.10cm}
\textbf{DS for training and certification.} 
We fine tune the $\text{\textsc{SmoothAdv}}$ trained models (either the retrained CIFAR10 models or the ImageNet models provided by \textsc{SmoothAdv}) using Algorithm \ref{alg:DS_DS}, where $\sigma_x^*$ is computed using Algorithm \ref{alg:RS_DS}. We report the per $\sigma$ certification accuracy comparing $\text{\textsc{SmoothAdv}-DS}^2$ (certified also using memory based certification) to both $\text{\textsc{SmoothAdv}-DS}$ and $\text{\textsc{SmoothAdv}}$. $\text{\textsc{SmoothAdv}-DS}^2$ further improves the certified accuracy as compared to $\text{\textsc{SmoothAdv}-DS}$ with performance gains more prominent on ImageNet. While the improvement of $\text{\textsc{SmoothAdv}-DS}^2$ over $\text{\textsc{SmoothAdv}-DS}$ is indeed small, \eg $0.7\%$ at radius $0.5$ on CIFAR10, we observe that the performance gaps are much larger on ImageNet reaching  $1.4\%$ at $0.5$ radius as shown in Table \ref{tb:SmoothAdv}. We see a similar trend in ACR with improvements of $0.013$ and $0.069$ on CIFAR10 and ImageNet, respectively. $\text{\textsc{SmoothAdv}-DS}^2$ boosts the certified accuracy of $\text{\textsc{SmoothAdv}}$  at radius 0.5 by $6.1\%$ and $4.2\%$ on CIFAR10 and ImageNet, respectively.




\input{ figs/qualitative}

% \vspace{-0.15cm}
\subsection{\textsc{MACER} + DS}
% \vspace{-0.15cm}
We integrate data dependent smoothing within \textsc{MACER} which trains $g_\theta$ by minimizing over the parameters $\theta$ the following objective
% updates model parameters using a regularization that encourages a maximum certification radius, as follows:
% \vspace{-2pt}
$
% \begin{align*}
% \min_\theta 
-\log g_\theta(x) + \frac{\lambda \sigma}{2} \max\left (\gamma - \frac{2R}{\sigma} , 0\right).\mathbbm{1}\{\argmax_c ~g^c_\theta(x) = y\}.
% \end{align*}
$ where $R$ also depends on $\theta$. While this seems to be similar in spirit to our approach, we in fact maximize the certification radius over $\sigma$ with fixed parameters $\theta$ for every $x$. 
% The expectations in the loss are approximated with Monte Carlo sampling.
We conduct experiments on CIFAR10
% \footnote{ResNet-50 trained models on ImageNet are not provided by the authors.
% Training them from scratch is prohibitively expensive.
% } 
following the training procedure of \textsc{MACER} estimating the expectation with $64$ samples, $\lambda = 12$, and $\gamma = 8$. We set $n=8$ in Algorithm \ref{alg:RS_DS} with ablations on $n=1$ in the \textbf{appendix}.


% \vspace{-0.10cm}
\textbf{DS for certification only.} Similar to the earlier setup in \textsc{Cohen} and \textsc{SmoothAdv}, we certify models with fixed $\sigma$ and then with data dependent $\sigma^*_x$ using the memory based certification, referred to as $\text{\textsc{MACER}-DS}$. In Figure \ref{fig:MACER}, we observe that $\text{\textsc{MACER}-DS}$ significantly outperforms $\text{\textsc{MACER}}$ particularly in the large radius region. This can also be seen in the envelope figure reporting the best certified accuracy per radius over $\sigma$. Similarly, Table \ref{tb:MACER} demonstrates the benefits of data dependent smoothing, where it boosts certified accuracy by $7.4\%$ (from $59.3\%$ to $66.7\%$) and $8.7\%$ ($43.6$ to $52.3$) at $0.25$ and $0.5$ radii, respectively. Moreover, we improve ACR by $0.139$ points.

% \vspace{-0.10cm}
\textbf{DS for training and certification.} We incorporate data dependent smoothing as part of \textsc{MACER} training and certification in a similar fashion to the earlier setup, dubbed $\text{\textsc{MACER}-DS}^2$. Figure \ref{fig:MACER} shows the improvement of $\text{\textsc{MACER}-DS}^2$  over the certification only $\text{\textsc{MACER}-DS}$ over all trained models. Table \ref{tb:MACER} summarizes the best certified accuracy per radius. Overall, we find that the performance is comparable or slightly better than $\text{\textsc{MACER}-DS}$, which is still significantly better than $\text{\textsc{MACER}}$ by $8.67\%$ at radius $0.5$. We also observe that  $\text{\textsc{MACER}-DS}$ enjoys better ACR than $\text{\textsc{MACER}-DS}^2$ with both being far better than the  $\text{\textsc{MACER}}$ baseline.
% \input{ figs/iteration_ablation}
% \input{ figs/appendix_visualization_\textsc{Cohen}}
% \input{ figs/appendix_visualization_\textsc{SmoothAdv}}


\begin{figure}
% \vspace{-0.95cm}
    \centering
        \includegraphics[width=0.23\textwidth]{ figures/iterations/CIFAR10_Adv_smooth_DS_DS_sigma_012.png}
        \includegraphics[width=0.23\textwidth]{ figures/iterations/ImageNet_Adv_smooth_RS_DS_sigma_050.png}
    \caption{\textbf{Varying %the number of iterations
    $K$ in Algorithm \ref{alg:DS_DS}.} Left figure shows certification with $\sigma_0=0.12$ on CIFAR10 and $\sigma_0=0.5$ on ImageNet is shown at the right.}
    \label{fig:iteration} 
\end{figure}

\subsection{DS for $\ell_1$ Certificates}

\textcolor{black}{At last, we extend our methodology to $\ell_1$ certification. We leveraged the results of \cite{yang2020randomized} that derived the tightest $\ell_1$ certificate using randomized smoothing with uniform distribution $\mathcal U[-\lambda, \lambda]^d$. The certified radius in that case has the form $\mathcal R_1 = \lambda(p_A - p_B)$. We replace our objective in Equation \eqref{eq:our_objective_v2} with:
\begin{equation}
\label{eq:our_objective_2}
\begin{aligned}
\lambda_x^* = &\text{arg}\max_\lambda \lambda \Bigg(\mathbb E_{\epsilon\sim \mathcal{U}[-\lambda, \lambda]^d}(f_\theta^{c_A}(x+\epsilon)) \\
&- \max_{c \neq c_A} \mathbb E_{\epsilon\sim \mathcal{U}[-\lambda, \lambda]^d}(f_\theta^{c}(x+\epsilon))  \Bigg).
\end{aligned}
\end{equation}
% \begin{equation}
% \begin{align*}
% \end{align*}
% \end{equation}
% \begin{equation}
% \label{eq:our_objective}
% \begin{aligned}
%     \sigma^*_x =&\argmax_{\sigma} ~\frac{\sigma}{2} \Bigg(\Phi^{-1}\left( \mathbb{E}_{\epsilon \sim \mathcal{N}(0,\sigma^2I)}[f_\theta^{c_A}(x+\epsilon)]\right)  \\ \,\, - \,\, & \Phi^{-1}\left(\max_{c \neq c_A} \mathbb{E}_{\epsilon \sim \mathcal{N}(0,\sigma^2I)}[f_\theta^c(x+\epsilon)]\right) \Bigg).
% \end{aligned}
% \end{equation}
We solved our objective in Eq~\eqref{eq:our_objective_v2} in an identical fashion to our Algorithm \ref{alg:RS_DS} with the same hyperparameters for $\lambda \in \{0.25, 0.5, 1.0 \}$ in certification on both CIFAR10 and ImageNet. Further, we combine our data-dependent smooth classifier with the memory based algorithm proposed in Section~\ref{sec:memory-algorithm}. It is worthwhile mentioning that similar to the $\ell_2$ case, the memory based algorithm did not find any overlap between the certified regions of any pair of instances.
We report the results in Table \ref{tab:l1}. We observe that, similar to our extensive experiments on the $\ell_2$ certificate, our proposed memory-enhanced data-dependent smoothing yields consistent improvement in the $\ell_1$ certified accuracy. We report an improvement  of 7\% and 3\% over the state of the art certified accuracy at $\ell_1$ radius of 0.5 on CIFAR10 and ImageNet, respectively. At last, we note similar improvement to the $\ell_1$ ACR as reported in Table \ref{tab:l1}.
}


% \vspace{-0.15cm}
\subsection{Discussion and Ablation}
% \vspace{-0.15cm}
\textbf{Varying $K$.} We pose the question: does attaining better solutions to our proposed Objective \ref{eq:our_objective_v2} improve certified accuracy? To answer this, we control the solution quality of $\sigma_x^*$ by certifying trained models with a varying number of stochastic gradient ascent iterations $K$ in Algorithm \ref{alg:RS_DS}. In particular, we certify the trained models $\text{\textsc{SmoothAdv}-DS}^2$ and $\text{\textsc{SmoothAdv}-DS}$ on CIFAR10 and ImageNet, respectively, with a varying $K$. We leave the rest of the experiments for other models to the \textbf{appendix}. We observe in Figure \ref{fig:iteration} that the certified accuracy per radius consistently improves as $K$ increases, particularly in the large radius regime. This is expected, since Algorithm \ref{alg:RS_DS} produces better optimal smoothing $\sigma_x^*$ per input $x$ with larger $K$, which in turn improves the certification radius leaving room for improvements with more powerful optimizers.
% \footnote{Greedy heuristics solving Equation \ref{eq:our_objective_v2} are in the \textbf{appendix}, as they perform far worse than our approach.}. 



% \vspace{-0.10cm}
\textbf{Visualizing $\sigma^*_x$.} We show the variation of $\sigma^*_x$ that maximizes the certification radius over different inputs $x$. Figure  \ref{fig:qualitative} shows two examples, where the first and fourth columns contain the clean images. In the second column, a choice of fixed $\sigma=0.5$ is too large compared to our estimated  $\sigma^*_x=0.368$ that maximizes the certification radius as per Algorithm \ref{alg:RS_DS}. As for the fifth column, we observe that a constant $\sigma=0.25$ is far less than $\sigma^*_x=0.423$. This indicates that indeed the $\sigma^*_x$ maximizing the certification radius varies significantly over inputs. 

% \vspace{-0.10cm}
% \textbf{Runtime.} We measure the certification runtime on an NVIDIA Quadro RTX-6000 GPU for our proposed data dependent smoothed classifier (time includes Algorithm \ref{alg:RS_DS} in addition to the memory based certification) compared to the certification of a fixed $\sigma$ classifier. Certifying one CIFAR10 test input with ResNet18 takes 1.6 and an average of 1.8 seconds for a fixed $\sigma$ classifier and for the data dependent classifier ($K = 900$), respectively. Certifying an ImageNet test input on ResNet50 takes 109.5 and an average of 136 seconds for a fixed $\sigma$ classifier and our data dependent classifier ($K=400$), respectively. The runtime overhead added by
% using Algorithm \ref{alg:RS_DS} and memory based certification is negligible compared to the gains in certified accuracy.






% \begin{figure*}[t]
%      \centering
%      \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_cifar10-0.25.png}
%      \end{subfigure}
%     %  \hfill
%      \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_cifar10-0.5.png}
%      \end{subfigure}
%     %  \hfill
%      \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_cifar10-1.0.png}
%      \end{subfigure}
%     % \hfill
%     \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_imagenet-0.25.png}
%      \end{subfigure}
%     %  \hfill
%      \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_imagenet-0.5.png}
%      \end{subfigure}
%     %  \hfill
%      \begin{subfigure}[b]{0.32\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{ICLR22/new_figures/l1_imagenet-1.0.png}
%      \end{subfigure}
%         \caption{
%         \textcolor{black}{ 
%         \textbf{$\ell_1$ Certified accuracy comparison against $\text{Yang}$ per radius per $\sigma$.} We compare $\text{Yang}$ against
%         % our data dependent certification
%         $\text{Yang-DS}$.
%         % and when data dependency is incorporated in both training and certification
%         % for several $\sigma$. 
%       We show CIFAR10 and ImageNet results in first and second rows, respectively. Similar to the earlier experiments on $\ell_2$ certificate, deploying data-dependent smoothing with the memory enhanced classifier yields significant improvement for the $\ell_1$ certified accuracy in all considered scenarios. }}
%         % , where the last column is the envelope.}
%         \label{fig:yang}
% \end{figure*}

\input{tables/l1}