DfPO: Degeneration-free Policy Optimization via Action Masking in Natural Language Action Spaces

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Reinforcement learning, Natural language processing
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: As the pre-training objectives (e.g., next token prediction) of language models (LMs) are inherently not aligned with task scores, optimizing LMs to achieve higher downstream task scores is essential. One of the promising approaches is to fine-tune LMs by using reinforcement learning (RL). However, conventional RL methods based on PPO and a penalty of KL divergence are vulnerable to the text degeneration problem which LMs do not generate natural texts anymore after RL fine-tuning. To address this problem, we provide Degeneration-free Policy Optimization (DfPO) that can fine-tune LMs to generate texts that achieve improved downstream task scores, while preserving the naturalness of the generated texts. To achieve this, we introduce action-masked policy with which a behavior policy can avoid to select tokens that potentially make policy optimization unexpected. Then, we devise clipped advantage functions to separately perform likelihood maximization and minimization, conditioned on texts sampled from the action-masked policy. Our experiments on the GRUE benchmark demonstrate that DfPO successfully improves the downstream task scores, while preserving the naturalness of the generated texts. Moreover, even DfPO does not perform hyperparameter search, it outperforms PPO and NLPO which require additional hyperparameter search for the penalty ratio of KL divergence.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9127
Loading