Mechanism Design for Alignment via Human Feedback

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Learning, Human Feedback, Mechanism Design
TL;DR: We propose a novel mechanism framework to incentivize effort and honesty in preference elicitation for RLHF
Abstract: Ensuring the faithfulness of human feedback is crucial for effectively aligning large language models (LLMs) using reinforcement learning from human feedback (RLHF), as low-effort or dishonest reporting can significantly undermine the quality of this feedback and, consequently, the alignment process. We address the challenge of faithfully modeling pairwise feedback by framing it as a mechanism design problem. We introduce a new principal-agent model for preference elicitation that incorporates both effort and truthfulness as key aspects of annotator strategies, and mirrors the assumptions made in reward modeling for RLHF. We then define three incentive compatibility properties that desirable mechanism frameworks should be able to satisfy: Uninformed Equilibrium Incompatibility, $\omega$-Bayes-Nash Incentive Compatibility, and Effort Competitiveness. We propose a novel mechanism framework called Acyclic Peer Agreement (APA), which we hope to prove can satisfy all three incentive compatibility frameworks. We conclude by discussing the next steps and outlining future research directions in the design of robust mechanisms for preference elicitation.
Submission Number: 71
Loading