Keywords: stochastic gradient descent, momentum, convergence
Abstract: Understanding convergence of SGD-based optimization algorithms can help deal with enormous machine learning problems. To ensure last-iterate convergence of SGD and momentum-based SGD (mSGD),
the existing studies usually constrain the step size $\epsilon_{n}$ to decay as $\sum_{n=1}^{+\infty}\epsilon_{n}^{2}<+\infty$, which however is rather conservative and may lead to slow convergence in the early stage of the iteration. In this paper, we relax this requirement by studying an alternate step size for the mSGD. First, we relax the requirement of the decay on step size to $\sum_{n=1}^{+\infty}\epsilon_{n}^{2+\eta_{0}}<+\infty\ (0\le\eta_{0}<1/2)$. This implies that a larger step size, such as $\epsilon_{n}=\frac{1}{\sqrt{n}}$ can be utilized for accelerating the mSGD in the early stage. Under this new step size and some common conditions, we prove that the gradient norm of mSGD for non-convex loss functions asymptotically decays to zero. In addition, we show that this step size can indeed help make the convergence into a neighborhood of the stationary points quicker in the early stage. In addition, we establish the convergence of mSGD under a constant step size $\epsilon_n\equiv\epsilon>0$ by removing the common requirement in the literature on the strong convexity of the loss function.
Some experiments are given to illustrate the developed results.
Supplementary Material: pdf
15 Replies
Loading