\documentclass{article}

% Please use the following line and do not change the style file.
\usepackage{icml2021_author_response}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{hyperref}       % hyperlinks
\usepackage{booktabs} % for professional tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.



\usepackage{lipsum}

\usepackage{wrapfig}

\usepackage{mylatexstyle}
\newcommand{\parallelsum}{\mathbin{\!/\mkern-5mu/\!}}
\def\CC{\textcolor{red}}

\begin{document}
% Uncomment the following line if you prefer a single-column format
\onecolumn

We thank all the reviewers for their helpful comments. We address the comments from the reviewers as follows.

\noindent\textbf{Response to Reviewers \#5 and \#6:} \textcolor{blue}{On the technical novelty of the paper}\\
\textbf{A.} We would like to clarify that our analysis is not a simple extension of existing techniques. We highlight the novel techniques in our paper as follows: (i) We develop a new proof technique for the equivalence between the maximum margin classifier and the minimum norm interpolator. Specifically, we establish Lemma 4.3, which utilizes the polarization identity to obtain a sharp bound; (ii) We introduce a novel technique to derive the population risk bound. Specifically, Lemma 4.7 uses multiple matrix concentration inequalities to analyze anisotropic models. Because of our novel proof techniques, our result can cover anisotropic settings, and is tighter than existing ones even when reduced to the isotropic setting.

\noindent\textbf{Response to Reviewer \#3:}\\
\noindent \textbf{Q1.} \textcolor{blue}{``the principle issue with the setting studied here is that there is no noise in data, ....'', ``Do the results extend to the case where there is label noise? If not, is there a setting that is captured by the assumptions here where the Bayes error is exact zero when the number of samples n goes to infinity?''}\\
\textbf{A1.} This is a misunderstanding. In fact, although our model is not exactly the same as Chatterji \& Long (2020) because we don’t have additional label flipping noise, there is still noise in our model because of the nature of sub-Gaussian mixture model. For example, consider a mixture of two Gaussian distributions. The two Gaussian clusters have non-trivial overlap, and the Bayes optimal classifier has non-zero Bayes risk. Therefore, the Bayes optimal classifier and the interpolating classifier are generally quite different.\\
In general, a model is appropriate for the study of benign overfitting whenever the optimal classifier has non-zero Bayes risk, like in our setting. 
% This is a misunderstanding. In fact, although our model is not exactly the same as Chatterji \& Long (2020) because we don’t have additional label flipping noise, there is still noise in our model because of the nature of sub-Gaussian mixture model. Thus the error of the Bayes optimal classifier cannot be zero in our model. For example, we can consider the mixture of two Gaussian distributions. The two Gaussian clusters have `overlap', and the Bayes optimal classifier cannot perfectly classify all the training data. Therefore, the Bayes optimal classifier and the overfitting classifier are very different, and 
% We can treat the training data points that are misclassified by the Bayes optimal classifier, but are correctly classified by the overfitting classifier as `noisy data'. This indicates that our setting is a valid and reasonable setting to study the benign overfitting phenomenon. %\CC{We believe most of the results in our paper can also be extended to the setting with additional label flipping noises, and we will comment it in the revision.}
\\
\noindent \textbf{Q2.} \textcolor{blue}{On the claim that Chatterji and Long'20 do not handle the anisotropic case because they require that $\tr(\bSigma) = \Omega(d)$}\\
\textbf{A2.} Thank you for pointing out this. You’re correct. We will revise our remark to make it accurate.

\noindent\textbf{Response to Reviewer \#5:}\\
\noindent \textbf{Q1.} \textcolor{blue}{``Major concern: The work of Chatterji \& Long (2020) study a different generative data process. In particular, they assume that $\ub \in \RR^d$ is sampled from an *arbitrary* product distribution where the marginals are sub-Gaussian. In contrast, the data generation in this paper assumes that the entries of u are independent''}\\
\textbf{A1.} This is a misunderstanding. The assumption in Chatterji \& Long (2020) that $\ub \in \RR^d$ is sampled from an arbitrary product distribution is just an equivalent way of assuming independence among the entries of $\ub$. Therefore our assumption is no stronger than the assumption in Chatterji \& Long (2020). Note that the proofs in Chatterji \& Long (2020) rely on the independence among the entries of $\ub$. For example, the proofs of Lemma 4.3 and Lemma A.6 in Chatterji \& Long (2020) use the Hoeffding's inequality, which requires that the entries of $\ub$ are independent.\\
\noindent \textbf{Q2.} \textcolor{blue}{Discussion on the conditions. ``... about the decaying eigenvalues of the covariance matrix. why it is a mild condition.''}\\
\textbf{A2.} We are sorry that the condition is not well discussed. Here we provide a detailed explanation. In Corollary 3.5, we can consider the example where the sample size $n$ is a constant. Then for the isotropic setting (where $\alpha = 0$), we need $\| \bmu \|_2 = \omega(d^{1/4})$ to achieve a small population risk. In comparison, for certain anisotropic settings with $\alpha \in (1/2,1)$, we only need to require $\| \bmu \|_2 = \omega(1)$ to achieve small population risk. Therefore the condition for the anisotropic example is milder than the condition for the isotropic setting. We will make the discussion clearer in the revision.

\noindent\textbf{Response to Reviewer \#7:}\\
\noindent \textbf{Q1.} \textcolor{blue}{``the mean vector $\mu$ is sometimes not in boldface, please double-check this.''}\\
\textbf{A1.} Thank you for pointing it out. We will fix the typos in the revision.\\
\noindent \textbf{Q2.} \textcolor{blue}{``I would like to see more discussions on the interplay between the mean vector and the covariance eigen-decay''}\\
\textbf{A2.} 
Thank you for the suggestion. We will add a 
corollary discussing the impact of the direction of $\bmu$ based on how it 
\begin{wrapfigure}{r}{10cm}
	\begin{center}
% 	\vspace{-0.25in}
% 		\begin{tabular}{cc}
 		\vspace{-0.25in}
			\subfigure[$\bmu\perp\vb_1$]{\includegraphics[height=0.8in,angle=0]{./mu_alignment_v2.pdf}\label{subfig:v2}}\qquad
			\subfigure[random direction]{\includegraphics[height=0.8in,angle=0]{./mu_alignment_mix.pdf}\label{subfig:mix}}\qquad
			\subfigure[$\bmu\perp\vb_2$]{\includegraphics[height=0.8in,angle=0]{./mu_alignment_v1.pdf}\label{subfig:v1}}
		  %  \end{tabular}
	\end{center}
	\vspace{-12pt}
	\caption{A $2$-dimensional illustration of  sub-Gausisan mixture classification problems with different directions of $\bmu$.  We consider the setting where $\bSigma \in \RR^{2\times 2}$ has two eigenvalues  $\lambda_1 > \lambda_2$ with the corresponding eigenvectors $\vb_1,\vb_2$.} 
	\label{fig:mualignment}
	\vspace{-.15in}
\end{wrapfigure}
aligns with the eigenvectors of $\bSigma$. The corollary can show that when $\bmu$ aligns with the eigendirection corresponding to a smaller eigenvalue of $\bSigma$, the population risk will be better. Consider an example where $\bSigma $ has eigenvalues  $\lambda_1 > \lambda_2$ with the corresponding eigenvectors $\vb_1,\vb_2$. The geometric intuition is illustrated in Figure~\ref{fig:mualignment}, where we show different settings including: (a) $\bmu$ aligns with $\vb_2$; (b) $\bmu $
 points at a random direction; (c) $\bmu$ aligns with $\vb_1$. It is clear from the illustration that (a) is the easiest case for classification and (c) is the hardest, which matches our theoretical analysis. We will add this figure and a more extensive discussion in the final version.

% This phenomenon perfectly matches the geometric intuition of sub-Gaussian mixture classifications, as
\end{document}
