Goodhart's Law in Reinforcement Learning

Jacek Karwowski; Oliver Hayman; Xingjian Bai; Klaus Kiendlhofer; Charlie Griffin; Joar Max Viktor Skalse

Goodhart's Law in Reinforcement Learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Max Viktor Skalse

Published: 16 Jan 2024, Last Modified: 15 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: reinforcement learning, goodhart's law, misspecification, reward learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We study Goodhart's law in RL empirically, provide a theoretical explanation for why it occurs, and use these theoretical insights to derive two methods for avoiding Goodharting.

Abstract: Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a *proxy* for the true objective rather than as its definition. We study this phenomenon through the lens of *Goodhart’s law*, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to *quantify* the magnitude of this effect and *show empirically* that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart’s law for a wide range of environments and reward functions. We then provide a *geometric explanation* for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an *optimal early stopping method* that provably avoids the aforementioned pitfall and derive theoretical *regret bounds* for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: reinforcement learning

Submission Number: 6832

Loading