A Constrained Bi-level Optimization Framework for Constrained Reinforcement Learning from Human Feedback

Yue Mao; Siyuan Xu; Shicheng Liu; Minghui Zhu

A Constrained Bi-level Optimization Framework for Constrained Reinforcement Learning from Human Feedback

Yue Mao, Siyuan Xu, Shicheng Liu, Minghui Zhu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Constrained Reinforcement Learning from Human Feedback

Abstract: This paper studies the problem of jointly learning a reward function, a cost function, and a policy from human feedback. We formulate the problem as a constrained bi-level optimization, where the upper level infers the reward and cost functions from feedback, while the lower level optimizes a policy to best align with that feedback. To solve this problem, we propose a double-loop algorithm, Constrained Bi-level Optimization for Reinforcement Learning from Human Feedback (CB-RLHF), which solves the lower-level optimization problem in the inner loop and the upper-level optimization problem in the outer loop. We establish a theoretical guarantee that CB-RLHF converges at a rate of $\mathcal{O}(\frac{1}{\sqrt{K}})$, and we demonstrate its empirical effectiveness across multiple simulation environments.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15686

Loading