A general framework for reward function distances

Erik Jenner; Joar Max Viktor Skalse; Adam Gleave

A general framework for reward function distances

Erik Jenner, Joar Max Viktor Skalse, Adam Gleave

Published: 05 Dec 2022, Last Modified: 05 May 2023MLSW2022Readers: Everyone

Abstract: In reward learning, it is helpful to be able to measure distances between reward functions, for example to evaluate learned reward models. Using simple metrics such as L^2 distances is not ideal because reward functions that are equivalent in terms of their optimal policies can nevertheless have high L^2 distance. EPIC and DARD are distances specifically designed for reward functions that address this by being invariant under certain transformations that leave optimal policies unchanged. However, EPIC and DARD are designed in an ad-hoc manner, only consider a subset of relevant reward transformations, and suffer from serious pathologies in some settings. In this paper, we define a general class of reward function distance metrics, of which EPIC is a special case. This framework lets as address all these issues with EPIC and DARD, and allows for the development of reward function distance metrics in a more principled manner.

1 Reply

Loading