Keywords: Language Model Pre-Training, Loss Landscape Geometry, Hyperparameter Tuning
TL;DR: We study how loss landscape geometry evolves during LLM pre-training, explaining learning-rate warmup and yielding batch-size scheduling that substantially improve data efficiency.
Abstract: The scale and expense of pre-training language models make efficient hyperparameter tuning essential, yet a principled guidance is still missing. Recent work shows that the geometry of loss landscape shapes training dynamics of neural networks and further informs hyperparameter choices. In this work, we analyze language model pre-training dynamics from a local landscape geometry perspective.
Our study reveals two distinct phases. In the *early* phase, sharpness of the local landscape is initially high, leading to instability and loss plateaus under large learning rates (LRs). Later, the landscape shifts from sharp to flatter regions. This dynamic explains the necessity of LR warmup and further suggests that larger peak LRs require proportionally longer warmup periods. In the *late* phase, the local landscape is governed by the gradient noise scale. Through diffusion-limit analysis, we prove a depth–flatness trade-off: high noise from smaller batches widens the loss basin, whereas reduced noise from larger batches deepens it. This theory motivates a dynamic batch-size (BS) scheduler that begins with a small BS and increases it late in training. Together, we provide a unified account of loss landscape evolution, which translates into actionable tuning strategies for large-scale pre-training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9645
Loading