\section{Related works}

SGD and its adaptive variants has been a target of intense interest in the last decade; we invite the readers for \citep{bottou2018optimization,ruder2016overview} for an overview, and we limit our discussions to the most relevant literature in the sequel.

\paragraph{Adaptive methods for non-convex optimization} Numerous of works mainly studied the convergence guarantees of adaptive methods in an expectation view. \citet{li2019convergence} provided a rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for AdaGrad-Norm with a delayed step-size that is independent from current stochastic gradient but required knowledge of smoothness parameter. Meanwhile, \citet{ward2020adagrad} obtained a rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for AdaGrad-Norm without prior knowledge of problem parameters but assuming a uniform bound of gradients. \citet{faw2022power} studied AdaGrad-Norm in the case of affine variance noise and achieved a desired convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ without assuming uniformly bounded gradients. More recently, \citet{alina2023on} established a convergence bound for AdaGrad-Norm and its acceleration version on quasar-convex smooth problems. The element-wise version of AdaGrad was studied by \citet{zou2019sufficient} with a heavy-ball or Nesterov style momentum. \citet{zaheer2018adaptive} and \citet{chen2018convergence} also proved convergence bounds for Adam-type algorithm. Later, \citet{defossez2020simple} improved the convergence rate for Adam and AdaGrad with a better dependency on the momentum parameter, specifically of order $\mathcal{O}((1-\beta)^{-1})$. Another recent work by \citet{guo2021novel} made progress in allowing for improved convergence rates with increasing momentum parameter, but again required knowledge of the smoothness parameter. In parallel, \citet{rakhlinconvergence} studied Adam convergence behavior on non-convex generalized smooth problems under the bounded noise variance assumption. More recently, \citet{zhou2023win} introduced the Nesterov-alike-acceleration into Adam and AdamW and justified the convergence superiority over non-accelerated version. 

\paragraph{High probability convergence bounds} \citet{ghadimi2013stochastic} provided a high probability convergence result for vanilla SGD with the knowledge of smoothness and sub-Gaussian noise. Later, \citet{harvey2019tight} established a high probability bound for Projected SGD with bounded noise on strongly-convex non-smooth Lipschitz functions. There exist some works for establishing high probability bounds for adaptive methods. \citet{zhou2018convergence} and \citet{ward2020adagrad} provided high probability bounds for AMSGrad and AdaGrad with sub-Gaussian noise and bounded noise, respectively, although neither of these bounds were optimal to the probability margin. \citet{li2020high} considered AdaGrad with a delayed step-size and obtained a high probability bound with the knowledge of smoothness. More recently, \citet{kavis2022high} made progress in obtaining an convergence bound that is optimal to the probability margin for AdaGrad-Norm, assuming uniformly bounded gradients and sub-Gaussian noise. Additionally, \citet{attia2023sgd} relaxed the uniformly bounded condition and obtained a bound adaptive to the noise level for AdaGrad-Norm under affine variance noise. In parallel, \citet{alina2023high} also established an optimal high probability bound for both scalar and element-wise versions of AdaGrad with sub-Gaussian noise. Recently, some other high probability convergence results emerged for Cilpped SGD on both convex and smooth non-convex cases under heavy-tail assumptions \citep{cutkosky2021high,sadiev2023high,nguyen2023high}.

