Target Rate Optimization: Avoiding Iterative Error Exploitation

Published: 07 Nov 2023, Last Modified: 20 Nov 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX
Keywords: reinforcement learning, off-policy RL, offline RL, divergence, convergence, stability, convergent, deadly triad
Abstract: Many real-world reinforcement learning (RL) problems remain intractable. A key issue is that sample-efficient RL algorithms are unstable. Early stopping sometimes works around this. Yet early stopping in RL can be difficult, since the instability itself can result in few training steps having good policies. Standard early stopping stops all learning. Fixing the early stopping implicitly used with most target networks might be more robust. That is, in algorithms like DQN, the target update rate already early-stops DQN’s target-fitting subproblems. Currently, practitioners must either hope the default target rate performs well, or tune it with an expensive grid search over online returns. Moreover, within a run, algorithms like DQN continue to update the target even when the updates _increase_ the training error. This degrades value estimates, which degrades returns. Newer off-policy and offline RL algorithms lessen this well-known deadly triad divergence, but often require excessive pessimism to avoid it, gaining stability but at lower return. To combat these issues, we propose adding optimization of the training error w.r.t. the target update rate. Our algorithm, Target Rate Optimization, empirically prevents divergence and increases return by up to ~3× on a handful of discrete- and continuous-action RL problems.
Submission Number: 81