Keywords: Optimization, Learning rate, LLM
TL;DR: Static, initialization-based learning rates as a simple yet effective method to improve neural network training
Abstract: A major characteristic of the Adam optimizer is its adaptive step size modification, which prevents large gradients from dominating the update step size. Given that the simplicity and computational efficiency of first-order methods are a significant advantage for large-scale training, we investigate an extreme form of step size modification that assigns static, layer-wise learning rates inversely to the initial gradient magnitude. We observe this simple heuristic is surprisingly effective in improving the rate of convergence on LLM style models over eight contemporary optimizers, suggesting the possibility of a static, initialization-based preconditioner.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 5779
Loading