Keywords: Reinforcement learning, Deep RL, parallelization, data scaling
TL;DR: We systematically trace the evolution of value-based algorithms from DQN to R2D2 and PQN, shedding new light on the trade-off between data collection and training stability.
Abstract: A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
Submission Number: 14
Loading