The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri; Leon Lang; Alessandro Abate; Patrick Forré; David Krueger; Joar Max Viktor Skalse

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Max Viktor Skalse

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward learning, RLHF, RL, Safety, Distributional shift, Generalization, Learning Theory

TL;DR: We study the relationship between the expected error of a reward model on some data distribution and the extent to which optimizing that reward model is guaranteed to produce a policy with low regret according to the true reward function.

Abstract: In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low loss on the training distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distribution shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to reliably measure their quality.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7701

Loading