Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning

Zechu Li; Tao Chen; Zhang-Wei Hong; Anurag Ajay; Pulkit Agrawal

Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning

Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, Pulkit Agrawal

Published: 01 Feb 2023, Last Modified: 22 Jun 2025Submitted to ICLR 2023Readers: Everyone

Keywords: GPU-based simulation, off-policy learning, distributed training, reinforcement learning

Abstract: Reinforcement learning algorithms typically require tons of training data, resulting in long training time, especially on challenging tasks. With the recent advance in GPU-based simulation, such as Isaac Gym, data collection speed has been improved thousands of times on a commodity GPU. Most prior works have been using on-policy methods such as PPO to train policies in Isaac Gym due to its simpleness and effectiveness in scaling up. Off-policy methods are usually more sample-efficient but more challenging to be scaled up, resulting in a much longer wall-clock training time in practice. In this work, we presented a novel parallel $Q$-learning framework that not only gains better sample efficiency but also reduces the training wall-clock time compared to PPO. Different from prior works on distributed off-policy learning, such as Apex, our framework is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. We demonstrate the capability of scaling up $Q$ learning methods to tens of thousands of parallel environments. We also investigate various factors that can affect the policy learning training speed, including the number of parallel environments, exploration schemes, batch size, GPU models, etc.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

TL;DR: We present a parallel training framework that scales up $Q$-learning algorithms on a single workstation and achieves faster learning speed than PPO.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/parallel-q-learning-scaling-off-policy/code)

21 Replies

Loading