RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang; Haimin Hu; Ryan Liu; Thomas L. Griffiths; Jaime Fernández Fisac

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning from Human Feedback (RLHF), Human-AI Alignment, Large Language Models, AI Safety, Partial Observability

TL;DR: We propose RLHS, a method that incorporates hindsight feedback to mitigate misalignment in Reinforcement Learning from Human Feedback.

Abstract: Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on *immediate* feedback, which can fail to reflect the true downstream impact of an interaction on users' utility. We demonstrate that this shortsighted feedback can, by itself, result in misaligned behaviors like sycophancy and deception, and we propose to alleviate this by refocusing RLHF on *downstream consequences*. Our theoretical analysis reveals that the hindsight gained by simply delaying human feedback mitigates misalignment and improves expected human utility. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods---Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO)---and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8740

Loading