Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Published: 01 May 2025, Last Modified: 24 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The success of sequential decision-making approaches, such as *reinforcement learning* (RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking *preferences*, (non-Markovian) *utilities*, and (Markovian) *rewards*, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback, with promising potential applications in RL from human feedback.
Lay Summary: The success of sequential decision-making approaches, such as Reinforcement Learning, is closely tied to the availability of a numerical reward function, which must capture the desired agent behavior. However, in complex real-world application, defining such a reward function is challenging. As such, an alternative, more realistic feedback has emerged in the literature: preference among trajectories. Such preferences can be provided, e.g., by a human expert who knows what goal the learning process aims to achieve. In this paper, we study the link between preference feedback and numerical reward functions, passing by trajectory utilities. We model preferences as partial orders, to capture the multi-dimensionality that can arise in a real-world problem. From this model, we define the concepts of dominance and optimality in terms of behavior, i.e., policies, and discuss their computational properties. Moreover, we study whether a reward function can be recovered from the observed preferences, we define a method to approximate it when exact recovery is not possible, and provide a bound on the performance difference that such an approximation introduces.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: preference-based, sequential decision making, preference feedback
Submission Number: 11786
Loading