A Constrained Bi-level Optimization Framework for Constrained Reinforcement Learning from Human Feedback
Keywords: Constrained Reinforcement Learning from Human Feedback
Abstract: This paper studies the problem of jointly learning a reward function, a cost function, and a policy from human feedback. We formulate the problem as a constrained bi-level optimization, where the upper level infers the reward and cost functions from feedback, while the lower level optimizes a policy to best align with that feedback. To solve this problem, we propose a double-loop algorithm, Constrained Bi-level Optimization for Reinforcement Learning from Human Feedback (CB-RLHF), which solves the lower-level optimization problem in the inner loop and the upper-level optimization problem in the outer loop. We establish a theoretical guarantee that CB-RLHF converges at a rate of $\mathcal{O}(\frac{1}{\sqrt{K}})$, and we demonstrate its empirical effectiveness across multiple simulation environments.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 15686
Loading