Aligning Language Models Using Multi-Objective Deep Reinforcement LearningDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: The alignment techniques used in state-of-the-art language models (LMs), e.g., reinforcement learning from human feedback (RLHF), have driven many successful Natural Language Processing (NLP) tasks. RLHF uses human preferences based on the guideline of being helpful and safe as a single reward signal to fine-tune language models. However, the trade-offs between helpfulness and safety are often found to be a problem, which makes it difficult for a model trained toward one objective to perform well on both. In this paper, we propose a new alignment technique, named multi-objective language model alignment (MOLMA). The framework is based on multi-objective deep reinforcement learning to fine-tune language models. MOLMA can efficiently address the conflicting or the dominating learning signal issue, which is caused by the the trade-offs of inherent, often conflicting, multi-objectives underlying the language model alignment task. From the overall objective of achieving both helpfulness and safety, our results show that MOLMA outperforms the other alignment techniques that rely on single-objective deep reinforcement learning.
Paper Type: long
Research Area: Machine Learning for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Theory
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview