LIRE: Listwise Reward Enhancement for Preference Alignment

Mingye Zhu

LIRE: Listwise Reward Enhancement for Preference Alignment

Mingye Zhu

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: LLM, RLHF, Preference alignment

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Recently, tremendous strides have been made in the domain of Natural Language Generation (NLG) due to the vast advances in Large Language Models (LLMs). However, often trained on large-scale unsupervised data, LLMs can generate toxic or unhelpful content for lack of human supervision. Leveraging reinforcement learning with human feedback (RLHF) turns out a good remedy for this problem and has been prevalent among researchers. However, RLHF is notoriously unstable and hyperparameter-sensitive, which hinders an all-compassing and sustainable LLM system. For the above reason, we propose a new approach: LIRE, which stands for Listwise Reward Enhancement for Preference Alignment, to optimize rewards through a listwise paradigm. We directly incorporate the rewards of multiple candidates into the listwise loss and optimize against it in a compact and effective framework, without explicit modeling of the Bradley-Terry model. Furthermore, we propose a self-enhancement algorithm to progressively optimize the reward through iterative training. Our work also entails extensive experiments to demonstrate the stability and consistency of the model performance without heavy hyperparameter tuning, while still surpassing the state-of-the-art methods in preference alignment tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 558

Loading