Keywords: Reinforcement Learning, Human-AI Alignment, Explainability, Reward Functions, Interpretability, RLHF, Visualization or interpretation of learned representations
TL;DR: Modeling reward functions as differentiable decision trees enables learning interpretable and expressive reward functions.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for capturing human intent to alleviate the challenges of hand-crafting
the reward values. Despite the increasing interest in RLHF, most works learn
black box reward functions that while expressive are difficult to interpret and often
require running the whole costly process of RL before we can even decipher if these
frameworks are actually aligned with human preferences. We propose and evaluate
a novel approach for learning expressive and interpretable reward functions from
preferences using Differentiable Decision Trees (DDTs). Our experiments across
several domains, including CartPole, Visual Gridworld environments and Atari
games, provide evidence that the tree structure of our learned reward function is
useful in determining the extent to which the reward function is aligned with human
preferences. We also provide experimental evidence that not only shows that reward
DDTs can often achieve competitive RL performance when compared with larger
capacity deep neural network reward functions but also demonstrates the diagnostic
utility of our framework in checking alignment of learned reward functions. We
also observe that the choice between soft and hard (argmax) output of reward
DDT reveals a tension between wanting highly shaped rewards to ensure good RL
performance, while also wanting simpler, more interpretable rewards. Videos and
code, are available at: https://sites.google.com/view/ddt-rlhf
Submission Number: 237
Loading