Keywords: Gradient Descent, Adaptive Step Size, Adaptive Learning Rate
Abstract: Selecting an appropriate learning rate for efficiently training deep neural networks is a difficult process that can be affected by numerous parameters, such as the dataset, the model architecture or even the batch size. In this work, we propose an algorithm for automatically adjusting the learning rate during gradient descent. The rationale behind our approach is to train the learning rate along with the model weight, akin to line-search. Contrary to existing approaches, learning rate is optimized via a simple extra gradient descent step, justified by an analysis that takes into consideration the structure of a neural network loss function. We formulate first and second-order gradients with respect to learning rate as functions of consecutive weight gradients, leading to a cost-effective implementation. We also show that the scheme can be extended to accommodate for different learning rates per layer. Extensive experimental evaluation is conducted, validating the effectiveness of the proposed method for a plethora of different settings. The proposed method has proven to be robust to both the initial learning rate and the batch size, making it ideal for an off-the-shelf optimization scheme.
One-sentence Summary: An extension of classic Gradient Descent, where instead of using a schedule for learning rate we propose treating it as a model parameter and learning it from the data.
Supplementary Material: zip
37 Replies
Loading