Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-objective reinforcement learning, human-in-the-loop, preference learning
TL;DR: We propose PBMORL, a human-in-the-loop MORL framework that learns preferences from limited feedback to efficiently discover high-quality, preference-aligned policies.
Abstract: Multi-objective reinforcement learning (MORL) seeks policies that effectively balance conflicting objectives. However, presenting many diverse policies without accounting for the decision maker’s (DM’s) preferences can overwhelm the decision-making process. On the other hand, accurately specifying preferences in advance is often unrealistic. To address these challenges, we introduce a human-in-the-loop MORL framework that interactively discovers preferred policies during optimization. Our approach proactively learns the DM’s implicit preferences in real time, requiring no a priori knowledge. Importantly, we integrate this preference learning directly into a parallel optimization framework, balancing exploration and exploitation to identify high-quality policies aligned with the DM's preferences. Evaluations on a complex quadrupedal robot simulation environment demonstrate that, with only interactions, our proposed method can identify policies aligned with human preferences, e.g., running like a dog. Further experiments on seven MuJoCo tasks and a multi-microgrid system design task against eight state-of-the-art MORAL algorithms fully demonstrate the effectiveness of our proposed framework. Demonstrations and full experiments are in https://sites.google.com/view/pbmorl/home.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 12302
Loading