Critical Batch Size for LLM Policy Optimization

Rachit Bansal; Clara Mohri; Natalie Abreu; David Alvarez-Melis; Sham M. Kakade

Critical Batch Size for LLM Policy Optimization

Rachit Bansal, Clara Mohri, Natalie Abreu, David Alvarez-Melis, Sham M. Kakade

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: critical batch size, GRPO

TL;DR: We analyze theoretically and empirically the critical batch size for LLM policy optimization

Abstract: Supervised learning's critical batch size defines the point at which increasing batch size leads to lost sample efficiency, which controls the extent to which data-parallel can be used to improve training efficiency. We study critical batch size for verifier-based reinforcement learning (RLVR) under a GRPO-style objective, where gradient noise depends on prompts $B$, rollouts per prompt $K$, and off-policy rollout reuse. We extend the noise-scale model from McCandlish et. al., 2018, to GRPO by decomposing on-policy noise into inter-prompt and intra-prompt terms, and modeling off-policy reuse as drift-inflated intra-prompt noise. We empirically measure the critical batch size in both on-policy and off-policy settings. In our experiments, we find that the gradient noise is dominated by the intra-prompt term and that the relevant batch dimension is approximately the total rollout count $N=BK$. We find that off-policy rollout reuse substantially increases the critical batch size relative to the on-policy setting, suggesting a practical parallelism advantage for RLVR post-training.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 139

Loading