Keywords: reinforcement learning, alignment, human feedback, rlhf, AI safety
TL;DR: A novel theretical framework that characterizes the limitations of approaches that only learn from human feedback
Abstract: Ensuring the alignment of artificial intelligence (AI) systems with human objectives is a critical challenge in the development of safe and effective AI technologies. Reinforcement learning from human feedback (RLHF) has been a predominant method to tackle this challenge. However, this framework operates under the unrealistic assumptions that human preferences are accurate reflections of their desires and that they remain constant over time. This paper identifies and challenges these assumptions by illustrating how they can lead to undesirable consequences, particularly when human beliefs about the environment are incorrect or mutate over time. To address these challenges, we introduce a novel framework termed practical alignment. This framework redefines the alignment objective to accommodate the variability and irrationality of human beliefs, emphasizing the need for AI systems not only to learn from but also to teach humans about the world. We discuss the theoretical underpinnings of practical alignment and introduce MindGrid, a toolkit designed to simulate and evaluate alignment scenarios. Our experimental results using large language models in teaching scenarios underscore the importance of teaching skills as a requisite capability to achieve alignment.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11867
Loading