Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

ACL ARR 2024 April Submission676 Authors

16 Apr 2024 (modified: 07 Jun 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based alignment methods as viable alternatives. This paper delves into existing order-based methods, unifying them into one framework and examining their inefficiencies in utilizing reward values. Building upon these findings, we propose a new Value-based Calibration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and diversity in different settings.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Language Modeling, Summarization, Question Answering
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 676
Loading