Critical Batch Size for LLM Policy Optimization

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: critical batch size, GRPO
TL;DR: We analyze theoretically and empirically the critical batch size for LLM policy optimization
Abstract: Supervised learning's critical batch size defines the point at which increasing batch size leads to lost sample efficiency, which controls the extent to which data-parallel can be used to improve training efficiency. We study critical batch size for verifier-based reinforcement learning (RLVR) under a GRPO-style objective, where gradient noise depends on prompts $B$, rollouts per prompt $K$, and off-policy rollout reuse. We extend the noise-scale model from McCandlish et. al., 2018, to GRPO by decomposing on-policy noise into inter-prompt and intra-prompt terms, and modeling off-policy reuse as drift-inflated intra-prompt noise. We empirically measure the critical batch size in both on-policy and off-policy settings. In our experiments, we find that the gradient noise is dominated by the intra-prompt term and that the relevant batch dimension is approximately the total rollout count $N=BK$. We find that off-policy rollout reuse substantially increases the critical batch size relative to the on-policy setting, suggesting a practical parallelism advantage for RLVR post-training.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 139
Loading