Why not both? Combining Bellman losses in deep reinforcement learning

Riad Akrour

Why not both? Combining Bellman losses in deep reinforcement learning

Riad Akrour

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Deep reinforcement learning, Soft actor-critic, policy evaluation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose adding a Bellman residual auxiliary loss to fitted-q evaluation and empirically demonstrated that the resulting policy evaluation becomes more robust to faster update rates of the target network

Abstract: Several deep reinforcement learning algorithms use a variant of fitted Q-evaluation for policy evaluation, alternating between estimating and regressing a target value function. In the linear function approximator case, Fitted Q-evaluation is related to the projected Bellman error. A known alternative to the projected Bellman error is the Bellman residual, but the latter is known to give worse results in practice for the linear case and was recently shown to perform equally poorly with neural networks. While insufficient on its own, we show in this paper that the Bellman residual can be a useful auxiliary loss for neural fitted Q-evaluation. In fact, we show that existing auxiliary losses based on modelling the environment's reward and transition function can be seen as a combination of the Bellman residual and the projected Bellman error. Experimentally, we show that adding a Bellman residual loss stabilizes policy evaluation, allowing significantly more aggressive target network update rates. When applied to Soft-Actor Critic---a strong baseline for continuous control tasks---we show that the target's faster update rates yield an improved sample efficiency on several Mujoco tasks, while without the Bellman residual auxiliary loss, fitted Q-evaluation would diverge in several such instances.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3472

Loading