How Does Local Landscape Geometry Evolve in Language Model Pre-Training?

ICLR 2026 Conference Submission9645 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Model Pre-Training, Loss Landscape Geometry, Hyperparameter Tuning
TL;DR: We study how loss landscape geometry evolves during LLM pre-training, explaining learning-rate warmup and yielding batch-size scheduling that substantially improve data efficiency.
Abstract: The scale and expense of pre-training large language models make efficient hyperparameter tuning essential, yet principled guidance remains limited. To address this gap, we analyze language model pre-training dynamics from a local landscape geometry perspective. Our study reveals two distinct phases. In the *early* phase, sharpness of the local landscape is initially high, leading to instability and loss plateaus under large learning rates (LRs). As training progresses, the landscape shifts from sharp to flatter regions. This dynamic explains the necessity of LR warmup and further suggests that larger peak LRs require proportionally longer warmup periods. In the *late* phase, the local landscape is governed by the gradient noise scale: high noise from smaller batches widens the loss basin, whereas reduced noise from larger batches deepens it. This insight inspires a dynamic batch-size (BS) schedule that increases the BS when the loss plateaus, achieving lower terminal loss with significant fewer tokens than constant-BS training. Together with our theory, we provide a unified account of loss landscape evolution, which translates into actionable tuning strategies for large-scale pre-training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9645
Loading