Keywords: Inverse Reinforcement Learning, Linear Bandits
TL;DR: We provide a technique and analysis to estimate the reward function of an learning agent in a Linear Stochastic Bandit problem.
Abstract: The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a \textit{single} demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator's trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset.
Supplementary Material: zip
List Of Authors: Guha, Etash and James, Jim and Acharya, Krishna and Muthukumar, Vidya and Pananjady, Ashwin
Latex Source Code: zip
Signed License Agreement: pdf
Code Url: https://github.com/jimtjames/InvPhasedElim
Submission Number: 550
Loading