Corner Gradient Descent

Dmitry Yarotsky

Corner Gradient Descent

Dmitry Yarotsky

Published: 09 Mar 2025, Last Modified: 09 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mini-batch stochastic gradient descent, sampling noise, convergence rates, acceleration, power laws, phase diagram, Riemann mapping theorem, contour integration, rational approximations, MNIST

TL;DR: Generalized GD algorithms = contours in $\mathbb C$. Contours with a corner of angle $\theta \pi$ accelerate rates $O(t^{-\zeta})$ to $O(t^{-\theta\zeta})$. In SGD, there is an optimal $\theta\in (1,2)$ balancing acceleration with sampling noise.

Abstract: We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates $L_t=O(t^{-\zeta})$, which can be improved to $L_t=O(t^{-2\zeta})$ by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no $O(t^{-2\zeta})$ algorithm is known. In this paper we show that rates up to $O(t^{-2\zeta})$ can be achieved by a generalized stationary SGD with infinite memory. We start by identifiyng generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle $\theta\pi$ accelerate the plain GD rate $O(t^{-\zeta})$ to $O(t^{-\theta\zeta})$. For deterministic GD, increasing $\theta$ allows to achieve rates arbitrarily close to $O(t^{-2\zeta})$. However, in Stochastic GD, increasing $\theta$ also amplifies the sampling noise, so in general $\theta$ needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by $\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu})$, where $\nu,\zeta$ are exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.

Submission Number: 4

Loading