The Minimax Complexity of Preference-Based Decision Making in Multi-Objective Reinforcement Learning

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: Multi-Objective Reinforcement Learning, Preference-Based Learning, Minimax Regret, Sample Complexity
Abstract: We study the fundamental decision-theoretic limits of preference based learning in multi-objective reinforcement learning (MO-RL). Unlike prior work that focuses on recovering latent reward representations, we frame the problem directly in terms of minimizing decision regret: selecting policies that align with an unknown utility function over vector-valued rewards using only pairwise preference queries. We introduce a minimax framework to analyze the worst-case sample complexity of preference-based policy selection in MO-RL and derive tight lower bounds on regret that depend on the dimensionality, curvature, and separation of the Pareto front. To complement these bounds, we propose a query-efficient algorithm that achieves these matched upper bounds under mild smoothness and noise assumptions. Our results show that, even without recovery of the underlying reward functions, an optimal policy selection is possible at a fundamental rate that tightly characterizes the hardness of multi-objective preference learning. This work highlights a gap between recovery of rewards and regret minimization in human-aligned decision-making, and provides a strong theoretical foundation for regret-optimal preference-based learning systems.
Submission Number: 122
Loading