Batch size-invariance for policy optimization

Jacob Hilton; Karl Cobbe; John Schulman

Batch size-invariance for policy optimization

Jacob Hilton, Karl Cobbe, John Schulman

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: reinforcement learning, policy gradient, learning rate

TL;DR: We show how to make PPO batch size-invariant (changes to the batch size can largely be compensated for by changing other hyperparameters) by decoupling the proximal policy (used for controlling the size of policy updates) from the behavior policy.

Abstract: We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/batch-size-invariance-for-policy-optimization/code)

14 Replies

Loading