Inverse Linear Bandits via Linear Programs

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Inverse Reinforcement Learning, Linear Bandits
Abstract: Inverse reinforcement learning (IRL) is a well-established paradigm for circumventing the need for explicit reward. In this paper, we study the problem of estimating the reward function from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. Our main result is a unified approach for inverse linear bandits, based on the idea of formulating a linear program by tightly characterizing the confidence intervals of pulled actions. We show that the estimation error of our algorithms matches the information-theoretic lower bound, up to polynomial factors in $d$ and $\log T$, where $d$ is the dimensionality of the feature space and $T$ is the length of the demonstration. Compared to prior approaches, our approach (i) gives a unified reward estimator that works when the demonstrator employs LinUCB or Phased Elimination, two popular algorithms for stochastic linear bandits, while existing estimator only works for Phased Elimination; (ii) does not require access to hyperparameters or internal states of the demonstrator algorithm as required by prior work; and (iii) works for general action sets, while existing estimator requires assumptions on the density and geometry of the action set. We further demonstrate the practicality of our new approach by validating our new algorithms on synthetic data and demonstrations constructed from real-world datasets, where our estimators significantly outperform existing ones.
Primary Area: learning theory
Submission Number: 24599
Loading