Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Matteo Sordello; Niccolo Dalmasso; Hangfeng He; Weijie J Su

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Matteo Sordello, Niccolo Dalmasso, Hangfeng He, Weijie J Su

Published: 17 Feb 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: •⁠ ⁠Added a subsection in Section 4.2 and a sub-figure in Figure 6 for the CIFAR-100 experiments (which we initially included in the response to reviewer dxDg) •⁠ ⁠Added a paragraph in Section 4.3 and a sub-figure in Figure 8 for the sensitivity analysis for Resnet18 on CIFAR-10 (which we initially included in the response to reviewer ocBo) •⁠ ⁠Improved the effectiveness of figure captions by clarifying and including figure takeaways (as pointed out by reviewers ocBo and 5Pef) •⁠ ⁠Added batch size information across all experiments (as pointed out by reviewer 5Pef) •⁠ ⁠Added our comments on the theoretical assumptions, as well as substituting ``efficient’’ with ``novel’’ in the conclusion (as discussed with reviewer ocBo). •⁠ Additionally, we have proofread the paper carefully again and fixed some grammars and other typos/errors.

Supplementary Material: zip

Assigned Action Editor: ~Arnaud_Doucet2

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1203

Loading