Static, Initialization-based Layer-wise Learning Rates

Kwang Yong Shin; Mincheol Park; Suhyun Kim; Soo-Mook Moon

Static, Initialization-based Layer-wise Learning Rates

Kwang Yong Shin, Mincheol Park, Suhyun Kim, Soo-Mook Moon

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimization, Learning rate, LLM

TL;DR: Static, initialization-based learning rates as a simple yet effective method to improve neural network training

Abstract: A major characteristic of the Adam optimizer is its adaptive step size modification, which prevents large gradients from dominating the update step size. Given that the simplicity and computational efficiency of first-order methods are a significant advantage for large-scale training, we investigate an extreme form of step size modification that assigns static, layer-wise learning rates inversely to the initial gradient magnitude. We observe this simple heuristic is surprisingly effective in improving the rate of convergence on LLM style models over eight contemporary optimizers, suggesting the possibility of a static, initialization-based preconditioner.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 5779

Loading