Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: deep reinforcement learning, off-policy deep reinforcement learning, constrained reinforcement learning, continuous action spaces, ai safety
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We diagnose the cause of poor learning by popular off-policy DRL algorithms in problems with continuous action spaces and rewards with independent incentive and cost terms, subsequently presenting a method that adeptly handles this setting.
Abstract: Methods for off-policy deep reinforcement learning (DRL) offer improved sample efficiency relative to their on-policy counterparts, due to their ability to reuse data throughout the training process. For continuous action spaces, the most popular approaches to off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with ``mixed-sign'' reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. In such environments, we find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic, because systematic errors impact the magnitude of $Q$ estimates associated with reward terms of opposite signs asymmetrically. This results in overemphasis of either incentives or costs, which may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks greatly reduces the error on $Q$ estimation and improves learning. Second, we formulate an off-policy actor-critic that does not include a $Q$ maximization term in the policy improvement step. This method supplements prior approaches with similar policy optimization steps, fortifying them to increase scalability, avoid the need for resetting, and be applicable to constrained learning when required. We find that our approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further explore the applicability of our approach to more frequently-studied control problems that do not have mixed-sign rewards, finding it to perform competitively and with favorable replay ratio scaling properties.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6117
Loading