
%\abcomment{Need to have a more detailed discussion of the related work}


The literature on gradient descent and variants  \pcedit{for deep learning optimization typically uses the NTK condition at initialization or variations of it as an integral part of their analysis.} \pcedit{Basically, the idea is that if the minimum eigenvalue of the NTK is bounded away from zero at initialization, then under suitable conditions, it is possible to show that this property also holds during training (in a local neighborhood around initialization).} \pcedit{The literature on deep learning optimization} is increasingly large, and we refer the readers to the following surveys for an overview of the field~\citep{JF-CM-YZ:21, PB-AM-AR:21}. \pcedit{For example, among the theoretical works on the analysis of multi-layer neural networks, we refer to the works}~\citep{SD-JL-HL-LW-XZ:19,ZAZ-YL-ZS:19,DZ-QG:19,DZ-YC-DZ-QG:20,CL-LZ-MB:21,AB-PCV-LZ-MB:22}. For a literature review on shallow and/or linear networks, we refer to the recent survey~\citep{CF-HD-TZ:21}. Due to the rapidly growing related literature, we only mention the most related or recent work. %for most parts.  

\pcedit{The works}~\citep{DZ-QG:19,DZ-YC-DZ-QG:20,ZAZ-YL-ZS:19,ng2020hermite1,ng2021opt,ng2021hermite2}  analyzed deep ReLU networks, whereas~\citep{SD-JL-HL-LW-XZ:19,CL-LZ-MB:21} consider smooth activation functions. The convergence analysis of the gradient descent in~\citep{SD-JL-HL-LW-XZ:19,ZAZ-YL-ZS:19,DZ-QG:19,DZ-YC-DZ-QG:20,CL-LZ-MB:21} relied on the near constancy of NTK for wide neural networks~\citep{AJ-FG-CH:18,JL-LX-SS-YB-RN-JSD-JP:19,SA-SSD-WH-ZL-RS-RW:19,CL-LZ-MB:20},  which yield  certain desirable properties for their training using gradient descent based methods. %\pccomment{Is this true though? It seems to need $\Omega(\max\{n^4,n\log(n)\})$ for smooth activations; see the paper.} 
One such property is related to the PL condition~\citep{karimi2016linear,ng2021opt}, formulated as PL$^*$ condition in~\citep{CL-LZ-MB:21}. \pcedit{For models with $L=O(1)$ layers, existing results need $m =\Omega(n^4)$ for smooth activations~\citep{SD-JL-HL-LW-XZ:19} to ensure convergence.} \pcedit{The recent work~\citep{AB-PCV-LZ-MB:22}} uses a different optimization analysis based on the restricted strong convexity (RSC) condition, \pcedit{which they relate to a restricted version of the PL condition, and compare such condition to the widely used NTK one. Finally, shallow ReLU networks (only one hidden layer) were studied by~\cite{ZJ-MT:19}. They showed that a data separability assumption along with $m$ having a poly-logarithmic dependence on $n$ allows gradient descent to provide training and testing guarantees. Interestingly, data separability assumptions can be incorporated in our results to establish further lower bounds to the minimum eigenvalue of the NTK  -- see Remark~\ref{rem:lambda1}.%The geometric convergence conditions in~\citep{AB-PCV-LZ-MB:22} are strengthened when the upper bound on the Hessian of the neural network is made smaller.
}

We also remark that the smallest eigenvalue of the NTK at initialization plays a crucial role including fitting capacity, as well as generalization behavior~\citep{SA-SD-WH-ZL-RW:19,AM-YZ:20,CL-LZ-MB:20,ng2021opt,ng2021hermite2,oymak2020hermite}.
%
%As mentioned earlier, the positiveness of the smallest eigenvalue of the NTK at initialization also plays an important role on fitting capacity and generalization behavior of neural networks. \pcedit{Again, we remark that for the case of ReLU activation, existing results need $m = \widetilde{\Omega}(n)$ to satisfy this condition~\citep{ng2021hermite2}, and for the smooth activation case, the same width is needed under specific conditions on the data~\citep{bombari2022memorization}. 
%We remark that until recently, for models with constant depth, $m = \widetilde{\Omega}(n^2)$ was needed for NTK at initialization to be positive definite for smooth activations~\citep{SD-JL-HL-LW-XZ:19}. \lzcomment{put the last paragraph to the first? since it's most related}
%}