Automatic Gradient Descent: Toward Deep Learning without Hyperparameters

TMLR Paper3486 Authors

13 Oct 2024 (modified: 25 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing frameworks for deriving first-order optimisers tend to rely on generic smoothness models that are not specific to deep learning. As a prime example, the objective function is often assumed to be Lipschitz smooth. While these generic models do facilitate optimiser design, the resulting optimiser can be ill-suited to a given application without substantial tuning. In this paper, we develop a new framework for deriving optimisers that are directly tailored to the network architecture in question. Our framework leverages an architecture-specific smoothness model that depends explicitly on details such as the width and number of layers. Analytically minimising this smoothness model yields the automatic gradient descent optimiser, or AGD for short. AGD automatically sets its weight initialisation and step-size as a function of the network architecture. Our central empirical finding is that AGD successfully trains on various architectures and datasets without many of the standard optimisation hyperparameters: initialisation variance, learning rate, momentum or weight decay. A caveat to this result is that our theoretical framework works in the full batch setting, so that AGD still involves a batch size hyperparameter. Furthermore, the performance of AGD does not always match that of a highly tuned baseline. However, we suggest several directions for refining and extending our analytical framework that may resolve these issues in the future.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ikko_Yamane1
Submission Number: 3486
Loading