Keywords: AI Safety, Human-AI Interaction, Reinforcement Learning, Online Learning
Abstract: For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behaviour. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of *learning the preferences of a learning agent*: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.
Submission Number: 76
Loading