Constrained Preference RLHF

Constrained Preference RLHF

ICLR 2026 Conference Submission21545 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning from human feedback, preference-based reinforcement learning, reinforcement learning with constraints

TL;DR: A dual-only approach to constrained reinforcement learning from human feedback

Abstract: We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle‑specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a one‑dimensional convex dual problem. We propose a dual‑only algorithm that ensures high‑probability constraint satisfaction and provide finite‑sample performance guarantees for the resulting Gibbs policy. Our analysis shows how estimation error, data coverage, and constraint slack jointly affect feasibility and optimality.

Primary Area: reinforcement learning

Submission Number: 21545

Loading