RLHF with Inconsistent Multi-Agent Feedback Under General Function Approximation: A Theoretical Perspective
Keywords: RLHF theory, inconsistent multi-agent feedback, regret analysis
Abstract: Reinforcement learning from human feedback (RLHF) has been widely studied, as a method for leveraging feedback from human evaluators to guide the learning process. However, existing theoretical analyses typically assume that the human feedback is generated by the ground-truth reward function. This may not be true in practice, because the reward functions in human minds for providing feedback are usually different from the ground-truth reward function, e.g., due to diverse personal experiences and inherent biases. Such inconsistencies could lead to undesirable outcomes when applying existing algorithms, particularly when considering feedback from heterogeneous agents. Therefore, in this paper, we make the first effort to investigate a more practical and general setting of RLHF, where feedback could be generated by multiple agents with reward functions differing from the ground truth. To address this challenge, we develop a new algorithm with novel ideas for handling inconsistent multi-agent feedback, including a Steiner-Point-based confidence set to exploit the benefits of *multi-agent* feedback and a new weighted importance sampling method to manage complexity issues arising from *inconsistency*. Our theoretical analysis develops new methods to demonstrate the optimality of our algorithm. This result is the first of its kind to demonstrate the fundamental impact and potential of inconsistent multi-agent feedback in RLHF.
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12291
Loading