Learning Reward Machines from Partially Observed Policies

Mohamad Louai Shehab; Antoine Aspeel; Necmiye Ozay

Learning Reward Machines from Partially Observed Policies

Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

Published: 16 Oct 2025, Last Modified: 16 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy {or demonstrations by an expert}. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine.{These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: All changes in the updated manuscript are highlighted in Red. Major Changes: Added a quantitative experiment to Section 5.4. Added Appendixes C,D,E as a response to reviewers comments.

Video: https://www.youtube.com/watch?v=KyzJDRPVz8o

Code: https://tinyurl.com/59smvbs6

Supplementary Material: zip

Assigned Action Editor: ~Jean_Honorio1

Submission Number: 4886

Loading