The Consequences of the Intrinsic Gap Between Reward Beliefs and MDP Rewards

ICLR 2026 Conference Submission19684 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reward beliefs, intrinsic reward gap
Abstract: Deep neural policies have gained the ability to learn and execute sequences of decisions in MDPs that involve complex and high-dimensional states. Despite the growing use of reinforcement learning in diverse fields from language agents to medical and finance, a line of research has focused on constructing reward functions by observing how an optimal policy behaves, with the underlying premise that this will result in policies that are aligned with the intended outcome. In this line of research, several studies have proposed algorithms for learning a reward function or an optimal policy from observed optimal trajectories, with the goal of achieving sample-efficient, robust, and aligned policies. In this paper, we analyze the implications of learning with reward beliefs in high-dimensional state representation MDPs and we demonstrate that standard deep reinforcement learning yields more resilient and value-aligned policies when compared to learning from the behaviour of other policies in MDPs with complex state representations.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19684
Loading