Robustness in the Face of Partial Identifiability in Reward Learning Problems

02 Feb 2026 (modified: 14 Apr 2026)Submitted to AFAA 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: Main Papers Track (6 to 9 pages)
Keywords: Inverse Reinforcement Learning, Reward Learning, Preference Based Reinforcement Learning, Theory, Alignment
TL;DR: We propose to tackle the identifiability problem in reward learning with a robust approach.
Abstract: Reward Learning (ReL) refers to a category of problems, including Reinforcement Learning from Human Feedback, in which the goal is to use some form of human feedback to align AI models. More formally, in ReL, we are given feedback on an unknown target reward, and the goal is to use this information to recover it in order to carry out some downstream application, e.g., planning. When the feedback is not informative enough, the target reward is only partially identifiable, i.e., there exists a set of rewards, called the feasible set, that are equally plausible candidates for the target reward. In these cases, the ReL algorithm might recover a reward function different from the target reward, possibly leading to a failure in the application. In this paper, we introduce a general ReL framework that permits to quantify the drop in "performance" suffered in the considered application because of identifiability issues. Building on this, we propose a robust approach to address the identifiability problem in a principled way, by maximizing the "performance" with respect to the worst-case reward in the feasible set. We then develop Rob-ReL, a ReL algorithm that applies this robust approach to the subset of ReL problems aimed at assessing a preference between two policies, and we provide theoretical guarantees on sample and iteration complexity for Rob-ReL. We conclude with some numerical simulations to illustrate the setting and empirically characterize Rob-ReL.
Submission Number: 27
Loading