How to Set the Batch Size for Large-Scale Pre-training?

How to Set the Batch Size for Large-Scale Pre-training?

ACL ARR 2026 January Submission6976 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large-scale pretraing, Large Language Models, batch size

Abstract: The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised $E(S)$ relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption $E$ and steps $S$ during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) $B_{\min}$, the minimum batch size threshold required to achieve a target loss, and 2) $B_{\text{opt}}$, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Language Modeling

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 6976

Loading