Preventing Reward Hacking with Occupancy Measure Regularization

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: reward hacking, safety, occupancy measures, reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: To prevent reward hacking in reinforcement learning, regularization based on a occupancy measures is superior to regularization based on action distributions.
Abstract: Reward hacking occurs when an agent performs very well with respect to a specified or learned reward function (often called a "proxy"), but poorly with respect to the true desired reward function. Since ensuring good alignment between the proxy and the true reward is remarkably difficult, prior work has proposed regularizing to a "safe" policy using the KL divergence between action distributions. The challenge with this divergence measure is that a small change in action distribution at a single state can lead to potentially calamitous outcomes. Our insight is that when this happens, the state occupancy measure of the policy shifts significantly—the agent spends time in drastically different states than the safe policy does. We thus propose regularizing based on occupancy measure (OM) rather than action distribution. We show theoretically that there is a direct relationship between the returns of two policies under *any* reward function and their OM divergence, whereas no such relationship holds for their action distribution divergence. We then empirically find that OM regularization more effectively prevents reward hacking while allowing for performance improvement on top of the safe policy.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2096
Loading