Keywords: reinforcement learning, stability, instability, policy collapse, return degradation, deadly triad, divergence, convergence, off-policy RL, offline RL, deep RL, value estimation, bootstrapping
Abstract: In reinforcement learning, deep $Q$-learning algorithms are often more sample- and compute-efficient than alternatives like the Monte Carlo policy gradient, but tend to suffer from instability that limits their use in practice. Some of this instability can be mitigated through a delayed *target network*, yet this usually slows down convergence. In this work, we explore the possibility of stabilization without sacrificing the speed of convergence. Inspired by self-supervised learning (SSL) and adaptive optimization, we empirically arrive at three modifications to the standard deep $Q$-network (DQN) — no two of which work well alone in our experiments. These modifications are, in the order of our experiments: 1) an **A**symmetric *predictor* in the neural network, 2) a particular combination of **N**ormalization layers, and 3) **H**ypergradient descent on the learning rate. Aligning with prior work in SSL, **HANQ** (pronounced "*hank*") avoids DQN's target network, uses the same number of hyperparameters as DQN, and yet matches or exceeds DQN's performance in our experiments on three out of four environments.
Submission Number: 137
Loading