Controlling language over-optimization by targeting reward distribution

Mathieu Rita; Florian Strub; Rahma Chaabouni; Paul Michel; Emmanuel Dupoux; Olivier Pietquin

Controlling language over-optimization by targeting reward distribution

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Models, Reinforcement Learning, fine-tuning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: This paper provides a method that aligns the reward distribution associated with sentences generated by a fine-tuned LLM with a predefined target reward distribution.

Abstract: Reinforcement Learning (RL) has become a key optimization tool for fine-tuning and aligning Large Language Models (LLM) with human preferences. However, this approach relies on reward models susceptible to reward over-optimization, wherein language models learn to hack the reward function, resulting in unnatural generations. In this paper, we address this issue by aligning the reward distribution of sentences generated by the fine-tuned model with a predefined target reward distribution. It offers an a priori and parameter-free control over the distribution of rewards of the model, setting it apart from other regularization and post-processing techniques. Our experiments show that this RL approach alleviates several optimization challenges in LLM: it reduces the log-likelihood error accumulation when generating lengthy sequences, mitigates reward hacking when generating positive reviews on IMDB, and upholds length constraints while aligning summaries with human preferences on the TL;DR dataset. Our findings highlight that targeting reward distributions is a promising strategy to better control and enhance the reliability of RL-based fine-tuning.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6349

Loading