Robust RLHF with Noisy Rewards

Robust RLHF with Noisy Rewards

ACL ARR 2024 December Submission1946 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we formulate the problem of performing robust RLHF with noisy reward models. Our goal is to design robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model. Our first contribution is an analysis that revealed a certain transformation of the preference function improves its robustness to noise in the reward function. This observation leads to a new reward function design that involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in Proximal Policy Optimization (PPO). We show that our suggested rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We also empirically demonstrate contrastive reward can improve RLHF substantially, evaluated by both GPTs and humans, and it consistently outperforms strong baselines.

Paper Type: Long

Research Area: Generation

Research Area Keywords: LLM Alignment, RLHF

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 1946

Loading