Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective.

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 FullEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Inverse constrained reinforcement learning, ICRL, Q learning, SQIL
TL;DR: Inverse Constrained Reinforcement Learning with Q values, mixing task rewards and safety, in continuous and stochastic settings.
Abstract: Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories without (a) increasing significantly the likelihood of trajectories with high cumulative rewards but with potentially unsafe steps, or (b) being too conservative. Having this objective, we aim towards learning a policy that maximizes the probability of the most promising trajectories with respect to the demonstrations. In so doing we formulate the promise of individual state action pairs in terms of Q values, which depend on task-specific rewards as well as on the assessment of steps' safety. This entails a safe Q-learning perspective of the imitation learning problem under constraints: The devised Safe Q Inverse Constrained Reinforcement Learning (SQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its strengths. The contributions that this article makes are as follows: (a) It formulates the problem of learning a constraints-abiding policy with respect to expert demonstrated trajectories as an inverse constrained reinforcement learning problem, whose objective function is specified in terms of Q-values of trajectory steps incorporating assessments on the safety of states, mixing expectations in terms of rewards and safety. (b) It proposes the safe Q Inverse Constrained Reinforcement Learning (SQIL) algorithm. (c) It presents evaluation results for SQIL in settings with constraints of increasing complexity, and these are compared to results from state of the art imitation and inverse constrained reinforcement learning algorithms.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 990
Loading