Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa; Drake Thomas; Adrià Garriga-Alonso

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF, extreme value theory, reward misspecification

TL;DR: KL regularization doesn't guarantee good outcomes in RLHF.

Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data, and therefore always has some error. It is common to mitigate this by regularizing the policy by KL divergence from a base model, with the hope that by balancing reward with regularization we can achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, the optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model—a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method developed for adversarial attacks to measure the tails of open-source reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Submission Number: 121

Loading