Keywords: reinforcement learning, rl, offline rl, continuous control, atari, sample efficiency
Abstract: Sample efficiency and performance in the offline setting have emerged as among the main
challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR),
a simple RL algorithm that excels in these aspects.
QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm
that performs very well on continuous control tasks, also in the offline setting, but struggles
on tasks with discrete actions and in sample efficiency. We perform a theoretical analysis
of AWR that explains its shortcomings and use the insights to motivate QWR theoretically.
We show experimentally that QWR matches state-of-the-art algorithms both on tasks with
continuous and discrete actions. We study the main hyperparameters of QWR
and find that it is stable in a wide range of their choices and on different tasks.
In particular, QWR yields results on par with SAC on the MuJoCo suite and - with
the same set of hyperparameters -- yields results on par with a highly tuned Rainbow
implementation on a set of Atari games. We also verify that QWR performs well in the
offline RL setting, making it a compelling choice for reinforcement learning in domains
with limited data.
One-sentence Summary: We analyze the sample-efficiency of actor-critic RL algorithms, and introduce a new algorithm, achieving superior sample-efficiency while maintaining competitive final performance on the MuJoCo task suite and on Atari games.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2102.06782/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=oJjJOoFGhu
13 Replies
Loading