Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^\pi$-Realizable MDPs

Antoine Moulin; Gergely Neu; Luca Viano

Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^\pi$-Realizable MDPs

Antoine Moulin, Gergely Neu, Luca Viano

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: imitation, inverse RL, learning from demonstration

Abstract: We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning $(\texttt{SPOIL})$, which is guaranteed to match the performance of any expert up to an additive error $\epsilon$ with access to $\mathcal{O}(\epsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\epsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of $\texttt{SPOIL}$ is superior to behavior cloning and competitive with state-of-the-art algorithms.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Luca_Viano1

Track: Regular Track: unpublished work

Submission Number: 55

Loading