Abstract: Value-based offline RL methods are prone to overestimate the values of out-of-distribution (OOD) actions, and this is often addressed by regularizing the action-value function in the Bellman update. However, existing regularization methods can suffer from being too conservative, which can arise from over-penalizing the values for both in-distribution actions and out-of-support actions. We present a new regularization method for offline value-based methods, called Density-Scaled (DS) regularization, which penalizes the value function based on the relative action density of the behavior policy. We show a theoretical connection between our method and the existing Supported Value Regularization (SVR) method, demonstrating how the SVR solution for policy evaluation can be viewed as a limiting case of the solution from the DS regularized problem. Empirical results demonstrate that the DS penalty is competitive with the state-of-the-art techniques, more robust to misestimation of the behavior density compared to SVR, and allows greater flexibility in learning hyperparameters associated with the behavior policy.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yue_Wang16
Submission Number: 8353
Loading