Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we formulate the problem of performing robust RLHF with noisy reward models. Our goal is to design robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model. Our first contribution is an analysis that revealed a certain transformation of the preference function improves its robustness to noise in the reward function. This observation leads to a new reward function design that involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in Proximal Policy Optimization (PPO). We show that our suggested rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We also empirically demonstrate contrastive reward can improve RLHF substantially, evaluated by both GPTs and humans, and it consistently outperforms strong baselines.
Paper Type: Long
Research Area: Generation
Research Area Keywords: LLM Alignment, RLHF
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1946
Loading