Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence

Published: 22 Jun 2025, Last Modified: 27 Jul 2025IBRL @ RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Proximal Policy Optimization, Parallel Data Collection
Abstract: In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.
Submission Number: 12
Loading