Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies

Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies

TMLR Paper5302 Authors

05 Jul 2025 (modified: 24 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework that learns a reward model from an offline dataset first and then optimizes a policy over the learned reward model through online reinforcement learning has been widely adopted. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state actions are unreliable and increase the complexity of the reinforcement learning problem. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning on explicitly constrained policies. The high-level idea is to limit the reinforcement learning agent to optimize over policies supported on an explicitly constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Sebastian_Tschiatschek1

Submission Number: 5302

Loading