When Bigger is Better: Revisiting Large-Batch Optimization in Language Model Pretraining

11 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: LLMs; batch size; large batch training; hyperparameter scaling laws; optimization; gradient noise
Abstract: Large-batch training sizes promise near-linear speedups in language model pertaining, yet existing studies highlight its poor optimization dynamics and degraded final performance. In this paper, we seek to understand the failure of large-batch training, and show that it can in fact substantially outperform conventional small-batch training. We first identify a critical oversight in the conventional view: large-batch training can substantially surpass small-batch baselines when provided sufficient tokens, but this advantage is often unrecognized due to its initial poor optimization dynamics, manifested as larger gradient norms and even worse per-step loss during early warm-up phases. To address this, we introduce a simple batch size scheduler that stabilizes and improves training at remarkably large batch sizes. Our scheduler scales pretraining up to batches of 32M tokens, using $3.3\times$ fewer computes to achieve the superior later-stage performance of large-batch training. Detailed analyses on gradient dynamics reveal that batch size fundamentally changes optimization geometry. Notably, we show that classic gradient noise scale metrics fail to predict the optimal batch size. Our findings offer practical recipes for designing efficient and effective pretraining pipelines, and deepen the theoretical understanding of large-batch optimization dynamics in language model pre-training.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 22931
Loading