Layerwise Learning Rate in the Era of Large Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning Rate, Large Language Models, Heavy-Tailed Self-Regularization
TL;DR: We propose a per-module learning rate method guided by heavy-tailedness, improving large language model performance.
Abstract: Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M–1B) demonstrate that LLR achieves up to a 1.5× training speedup compared to uniform LR. Under the same training token budget, LLR further surpasses existing approaches by a clear margin. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline, substantially lowering the barrier to practical adoption. Our code is submitted.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10178
Loading