Keywords: Pitfalls; Human Proxies; Alignment
Abstract: The demand for regulating the behavior of large language models (LLMs) has ignited research on alignment algorithms, the essence of which is to align LLMs' generations with human preferences. Due to infeasibility of humans directly participating in the training or generation of LLMs, existing alignment algorithms choose to align with human preferences carried by proxies, i.e., preference data or reward models. However, whether these human proxies faithfully represent human preferences remain under-explored. We categorize human proxies into two levels based on the degree to which they directly embody human preferences: Level-1 Proxy (preference data) and Level-2 Proxy (reward models). We empirically examine the faithfulness of both levels of proxies and its impacts on alignment performance.
We notice that current algorithms tend to overlook the faithfulness of these proxies in reflecting human preferences; many works even directly use reward models as their automatic evaluators without any correlation verification. Current literature of alignment overly focuses on optimizing algorithms, rendering the faithfulness of human proxies an "elephant in the room"—something extremely important yet largely overlooked. According to experimental results, we unveil potential risks of using inferior ``human proxies'', aiming to arouse attention to this huge ``elephant'' in alignment research. We summarize existing pitfalls from different angles and provide a re-labeled preference dataset and insights about reward model usage to facilitate the healthy development of alignment\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4655
Loading