Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

Jianhui Sun, Ying Yang, Guangxu Xun, Aidong Zhang

Published: 01 Jan 2023, Last Modified: 13 Nov 2023ACM Trans. Knowl. Discov. Data 2023Readers: Everyone

Abstract: This article1 studies how to schedule hyperparameters to improve generalization of both centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov’s accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, despite empirical advantage over classical SHB/NAG, introduce extra hyperparameters to tune. The error-prone tuning is the main barrier for AutoML. Centralized SGD: We first focus on centralized single-machine SGD and show how to efficiently schedule the hyperparameters of a large class of momentum variants to improve generalization. We propose a unified framework called multistage quasi-hyperbolic momentum (Multistage QHM), which covers a large family of momentum variants as its special cases (e.g., vanilla SGD/SHB/NAG). Existing works mainly focus on only scheduling learning rate α’s decay, while multistage QHM allows additional varying hyperparameters (e.g., momentum factor), and demonstrates better generalization than only tuning α. We show the convergence of multistage QHM for general non-convex objectives. Distributed SGD: We then extend our theory to distributed asynchronous SGD (ASGD), in which a parameter server distributes data batches to several worker machines and updates parameters via aggregating batch gradients from workers. We quantify the asynchrony between different workers (i.e., gradient staleness), model the dynamics of asynchronous iterations via a stochastic differential equation (SDE), and then derive a PAC-Bayesian generalization bound for ASGD. As a byproduct, we show how a moderately large learning rate helps ASGD to generalize better. Our tuning strategies have rigorous justifications rather than a blind trial-and-error as we theoretically prove why our tuning strategies could decrease our derived generalization errors in both cases. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically. Our codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.

0 Replies