Nonmonotone Line Searches Operate at the Edge of Stability

Curtis Fox; Leonardo Galli; Mark Schmidt; Holger Rauhut

Nonmonotone Line Searches Operate at the Edge of Stability

Curtis Fox, Leonardo Galli, Mark Schmidt, Holger Rauhut

Published: 10 Oct 2024, Last Modified: 07 Dec 2024NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Edge of Stability, Gradient Descent, Line Search, Large Scale Optimization, First Order Optimization

TL;DR: We show that non-monotone line searches operate at the edge of stability.

Abstract: The traditional convergence analysis of Gradient Descent (GD) assumes the step size to be bounded from above by twice the reciprocal of the sharpness, i.e., the largest eigenvalue of the Hessian of the objective function. However, recent numerical observations on neural networks have shown that GD also converges with larger step sizes. In this case, GD may enter into the so-called edge of stability phase in which the objective function decreases faster than with smaller steps, but nonmonotonically. Interestingly enough, this same behaviour was already observed when using nonmonotone line searches. These methods are designed to accept larger steps than their monotone counterparts (e.g., Armijo) as they do not impose a decrease in the objective function, while still being provably convergent. In this paper, we show that nonmonotone line searches are able to operate at the edge of stability regime right from the start of the training. Moreover, we design a new resetting technique that speeds up training and yields flatter solutions, by keeping GD at the edge of stability without requiring hyperparameter-tuning or prior knowledge of problem-dependent constants. To conclude, we observe that the large steps yielded by our method seem to mimic the behavior of the well-known Barzilai-Borwein method.

Submission Number: 16

Loading