SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 OralEveryoneRevisionsBibTeX
Keywords: Reinforcement Learning, Distributed Systems, Large Scale Training
Abstract: The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to train intelligent agents by efficiently producing and processing a massive amount of data. In this paper, we propose a comprehensive computational abstraction for RL training tasks and introduce a scalable, efficient, and extensible RL system called Really Scalable RL (SRL), featuring a novel architecture that separates three major computation components in RL training. Our evaluation demonstrates that SRL outperforms a popular open-source RL system RLlib RLlib (Liang et al., 2017) in training throughput. Moreover, to assess the learning performance of SRL, we have conducted a benchmark on a large scale cluster with 32 Nvidia A100 GPUs, 64 Nvidia RTX 3090 GPUs and more than 10000 CPU cores, reproducing the results of industrial production system from OpenAI, Rapid (Berner et al., 2019) in the hide and-seek environment (Baker et al., 2019). The results show that SRL is capable of achieving up to 5 times training speedup compared to published results in Baker et al. (2019).
Submission Number: 7