\newpage
\section{Rebuttal}
\subsection{bfhi Score:5 Confidence:4}


\paragraph{Our response.} We thank the reviewer for the detailed feedback. We address the raised concerns from the reviewer in the same order. We are glad that the reviewer found the idea novel and intriguing and that it has a potential high impact within a subfield of AI.

\textbf{A memory-based based classifier means that its performance will depend on the data it previously saw, and it also depends on the order data comes in.} In general, the reviewer is correct about their observation and we have openly discussed this in the paper as stated in the Limitations Section E of the Appendix. However, observe that memory based certification is order dependent only if two differently predicted samples have overlapping certified regions, i.e. when either the middle or right most cases on Figure 2 occur. That is to say, if overlapping certified regions of different predictions does not occur, the memory based approach is order invariant. A natural question that arises is: ``How likely is it to have two certified regions, note that they are often of $\ell_2$ order of 1.0, intersect for two differently predicted samples?" We observe that this is far unlikely to occur, unless the accuracy of the classifier is poor. During memory based certification, we observe that for all experiments conducted on the respective datasets, this scenario does not happen. This implies that all reported certified accuracies in the paper are in fact the same for any arbitrary order of testing samples of the respective datasets.


\textbf{It seems to me it will become harder and harder to certify when more points are in memory. If there are infinite number of data coming in, what is the asymptotic performance of the proposed method? Would certified accuracy drop to 0 asymptotically? And how does the order of data affect its asymptotic performance?} The observation of dense ``infinite" data for testing is partially true. While indeed the more samples tested the more likely for an intersection of certified regions to happen, however, it is not necessarily the case that the intersections of certified regions of differently predicted samples will increase. To that end, this does not mean that all new samples will later be necessarily certified with consecutively smaller regions, as observed on the respective datasets, since intersected regions of differently classified samples are rare. However, this naturally suggests an interesting direction. A stronger adversary is one with a full access to the history of all previously tested samples in memory. Such an attacker can then present few malicious samples for testing that will go in memory such that for future samples this can cause the data dependent classifier to intersect with higher chances with previously differently predicted samples causing certified accuracy to asymptotically drop for future samples. While this adversary could exist in principal, it is also not practically of full relevance. This is since in practice classifiers at deployment get updated in which new classifiers with fresh new memory are to be certified. In general, this is an interesting future research direction, but currently, a definition of such a class of adversaries falls beyond the scope of our paper.

% Moreover, even in classical certification, attaining a certain certified accuracy performance on a dataset reflects the performance of such adversary

\textbf{Following the methodology in this paper, conceptually, we can also discard all the remembered certifications in memory, and start fresh with a new classifier each time. Essentially this means we are using one fresh classifier for each example. Is it possible to do so and what is the implication?} If one resets the memory and provide certification for every sample independently along with a certified radius, the best one can deduce from such a classifier is that if $x+\delta$ is presented to the classifier, to attain a sound certification, the same $\sigma$ used for $x$ should also be used for $x+\delta$. If such a condition can not be practically deployed, we only end up with a theoretical classifier (since we do not have a procedure to implement such a classifier with a constant $\sigma$ locally) that is with a sound certification. Thus, the memory here is to allow for a practical implementation of this classifier with a sound practical certification. Thus if the memory is reset consecutively when testing every one of the new $n$ samples, we are at either (1) providing a sound certification of $n$ tested samples for a theoretical classifier or (2) providing $n$ sound certifications for $n$ different smooth classifiers (each with a different $\sigma$). This is where the use of memory bridges this gap into providing a single practical classifier with a sound certification.



% This is since the randomized smoothing certification suggests that given $x$ if for all samples $x+\delta$ in the certified region of $x$ denoted as $\mathcal{R}(x)$ is predicted with a constant $\sigma$ prediction does not change.


\textbf{Is it possible to achieve soundness without memory?} Currently, we do not have any particular obvious approach to drop the dependence on previous samples for certification. This could be a potential future direction that we currently do not yet know how to approach.





% Note that as mentioned in Section 3.5, we found that this is the case in all of our experimental results where no instances with differently predicted labels share any overlap in their certified regions making our memory-enhanced classifier order invariant. 


\subsection{S8Ji Score:7 Confidence:3}
We thank the reviewer for the valuable and detailed comments. We are glad that our work was well-received by the reviewer.

\textbf{Regarding the dependent of the memory algorithm on the order.} Indeed, the reviewer comment is on spot in their observation. The memory algorithm is order dependent. We have openly discussed this in the paper as stated in the Limitations Section E of the Appendix. However, this drawback had no impact in all of our experimental results. As discussed in our response to Rerviewer bfhi, memory based certification is order dependent only if two differently predicted samples have overlapping certified regions, i.e. when either the middle or right most cases in Figure 2 occur. That is to say, if overlapping certified regions of different predictions does not occur, the memory based approach is order invariant. We observe that this it is is far to have intersection of two different certified regions of differently predicted samples. In all our experiments, we observe that this scenario does not happen (middle and right most settings of Figure 2). This implies that all reported certified accuracies in the paper are in fact the same for any arbitrary order of testing samples of the respective datasets.

\textbf{Regarding the constrained covariance matrix.} We agree with reviewer S8Ji. Anisotropic covariance matrices could increase the certified volume. However and for a fair comparison with earlier approaches such as SmoothADV and MACER, we devoted our analysis in this work for isotropic covariance matrices leaving the anisotropic counterparts for a future work. Moreover, considering full covariance matrices requires (1) deriving the resultant certificate (e.g. ellipsoid as opposed to $\ell_2$ balls) (2) defining a new approach towards comparing two anisotropic certified regions through volumes or other metrics.

\textbf{Regarding learning or predicting the variance.} This is an excellent idea that would alleviate the optimization cost of obtaining the optimal smoothing parameters. However, we argue that learning/predicting the smoothing parameters is a very challenging problem that could potentially reduce the improvement of using data dependent smoothing. We have had initial experiments of learning a separate network to predict $\sigma$ for the sole purpose of speeding up certification as this involves a forward pass through a network as opposed to solving the optimization problem. However, we do not find any particular gains, in fact this approach down performs the classical baselines as Cohen et al. We leave the investigation for this line of work for a future direction.


\subsection{MTkP Score:7 Confidence:3}
We thank the reviewer for the valuable and detailed comments. We are glad that our work was well-received by the reviewer.

\textbf{Regarding the runtime comparison}: We thank the reviewer for pointing out this fair comparison. We have included a detailed comparison of the running time in Appendix H. We will include this paragraph in the main paper in the final version.


\begin{figure}
    \centering
    \includegraphics[width=0.5\textwidth]{UAI22/rebuttal_figures/cohen_comparison_0.25.pdf}
    \caption{Rebuttal Experiment}
    \label{fig:rebuttal_experiment}
\end{figure}


\textbf{Regarding the minor issues.} We thank the reviewer for this insightful experiments. As suggested, we conducted this experiment for Cohen baseline at $\sigma=0.25$. We plot the histogram of the obtained $\sigma_x^*$ for CIFAR10 in orange. We also plot a histogram of the $\sigma_x^*$ at which the certified radius is improved in green. We report the results in Figure~\ref{fig:rebuttal_experiment}.
We found that the certified robustness improvements happen at the full spectrum of $\sigma_x^*$ showing the efficacy of our proposed data-dependent smoothing. We will include a more detailed version of this experiment in the final version.



\textbf{Regarding the proposed adjustment for the middle scenario in Fig 2.} The reviewer is correct about their observation and suggestion. We opted to the solution with the least computational overhead which is adjusting the radius of the new data as opposed to adjusting the radius of every intersected sample in memory. We would like to bring to the attention of the reviewer that in practice, given that the certified regions around every point is generally small, scenarios 2 and 3 in Figure 2 never happened on the correspondingly tested datasets, i.e. there are no differently predicted samples with intersected certified regions.

% However, better alternatives could exist that could potentially result in a better overall certified radius.
% There might exist many solutions for this case such as the one pointed out by the reviewer.


\textbf{Regarding $\hat x$ and $x'$}. Yes, this is indeed a typo where the optimization is over the adversary $x'$. We will correct this in a potential future version.

% We refer to $\hat x$ as the adversarial attack. $x'$ is the optimization variable in the objective:
% \[\max_{x'} -\log \mathbb {E}_{\epsilon \sim \mathcal N (0,\sigma^2I)} \left[f^y_{\theta}(x' + \epsilon)\right], \qquad \text{where}\quad \|x'-x\| \leq \zeta
% \]
% We will make this clearer in the final version.







\subsection{7p67 Score:7 Confidence:2}
We thank the reviewer for the valuable and detailed comments. We are glad that our work was very well-received with generally positive feedback by the reviewer.

\textbf{Regarding the presentation details.} We thank the reviewer for spotting some typos and fir suggesting several few ways to improve presentation. We will surely address those in any potential camera ready version.

% We have used the command eqref to refer to previous equations so that the reader could click on it to jump directly to the equation. We will fix this as per the suggestion and add paranthesis for better clarification.

% (2) We thank the reviewer for spotting this clerical error which will be fixed in the final version.

\textbf{Regarding predicting $\sigma^*$ for any $x$.} We thank the reviewer for sharing this interesting idea. We thought about learning $\sigma_x^*$ through a network that is jointly trained with the classifier. However, and due to the hardness of the problem, we were not able to attain robustness improvements from that approach over the Cohen et al baseline. The loss function used is similar to Equation 2, but the optimization is over parameters $\phi$ of an encoder network where $\epsilon$ is sampled from the encoder network. This is in similar spirit to variational autoencoders (VAEs). The reason for why this approach did not work could be attributed to several factors such as the poor performance of the classifier at early epochs, the complexity of the optimization, etc. Nevertheless, we believe that such approaches, if succeeded, would have several advantages such as alleviating the optimization cost.