Evaluating Oversight Robustness with Incentivized Reward Hacking

28 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scalable oversight, robustness, reward hacking
Abstract: Scalable oversight aims to train systems to perform tasks that are hard for humans to specify, demonstrate and validate. As ground truth is not available for such tasks, evaluating scalable oversight techniques is challenging: existing methods measure the success of an oversight method based on whether it allows an artificially weak overseer to successfully supervise an AI to perform a task. In this work, we additionally measure the robustness of scalable oversight techniques by testing their vulnerability to reward hacking by an adversarial supervisee. In experiments on a synthetic domain, we show that adding an explicit reward hacking incentive to the model being trained leads it to exploit flaws in a weak overseer, and that scalable oversight methods designed to mitigate these flaws can make the optimization more robust to reward hacking. We hope these experiments lay a foundation for future work to validate scalable oversight methods' ability to mitigate reward hacking in realistic settings.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12667
Loading