Interpretable Reward Learning via Differentiable Decision TreesDownload PDF

Published: 05 Dec 2022, Last Modified: 05 May 2023MLSW2022Readers: Everyone
Abstract: There is an increasing interest in learning rewards and models of human intent from human feedback. However, many methods use blackbox learning methods that, while expressive, are hard to interpret. We propose a novel method for learning expressive and interpretable reward functions from preference feedback using differentiable decision trees. We test our algorithm on two test domains, demonstrating the ability to learn interpretable reward functions from both low- and high-dimensional visual state inputs. Furthermore, we provide preliminary evidence that the tree structure of our learned reward functions is useful in determining the extent to which a reward function is aligned with human preferences.
1 Reply