Towards Revealing the Effect of Batch Size Scheduling on Pre-training

Towards Revealing the Effect of Batch Size Scheduling on Pre-training

ICLR 2026 Conference Submission18875 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Batch Size Scheduling; Training Dynamics

Abstract: Training large-scale foundation models relies on effective parallelism strategies, especially batch size scheduling. However, despite its widespread practical use, the influence of batch size scheduling on training dynamics remains poorly understood. In this work, we first investigate this through a simple two-stage batch size schedule. Specifically, we train the language models with a constant learning rate using three batch size schedules: i) small constant batch size, ii) large constant batch size, and iii) a schedule that switches from small (i) to large (ii) at some switching point. We observe two notable behaviors: (1) **sudden drop**, a sharp drop in loss occurs at the switching point, compared to the loss trajectory of the small batch size; (2) **final merge**, a gradual convergence in loss to the trajectory of the large batch size. To understand the underlying mechanism behind these phenomena, we then provide a theoretical analysis from the perspective of power-law kernel regression setup. We leverage the **Functional Scaling Law (FSL)** introduced in the recent work by Li et al. (2025), which provides a theoretical framework for analyzing LLM pre-training dynamics. Our analysis shows that increasing batch size provably leads to a sudden loss drop by reducing SGD noise and guarantees convergence to the large batch trajectory at the same step level. Under the data-limited regime, our analysis further reveals a trade-off between intrinsic optimization time and SGD noise in the choice of switching point, predicting that the optimal switching point scales as a power law with total data size. Finally, we empirically validate these theoretical findings through language model pre-training experiments up to 1.1B parameters and 1T tokens, confirming the consistency of our theoretical insights.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 18875

Loading