Aligning Language Models Using Multi-Objective Deep Reinforcement Learning

Aligning Language Models Using Multi-Objective Deep Reinforcement Learning

ACL ARR 2024 June Submission3470 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

The alignment techniques used in state-of-the-art language models (LMs), e.g., reinforcement learning from human feedback (RLHF), have driven many successful natural language processing (NLP) tasks. RLHF uses human preferences based on the guidelines of being helpful and safe as a single reward signal to fine-tune language models. However, the trade-offs between helpfulness and safety are often found to be a problem, which makes it difficult for a model trained towards one objective to perform well on both. This paper proposes a new alignment technique, multi-objective language model alignment (MOLMA). The framework is based on multi-objective deep reinforcement learning to fine-tune language models. MOLMA can efficiently address the conflicting or the dominating learning signal issue caused by the trade-offs of inherent, often conflicting, multi-objectives underlying the language model alignment task. From the overall objective of achieving helpfulness and safety, our results show that MOLMA outperforms the other alignment techniques that rely on single-objective deep reinforcement learning.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning, optimization methods, generative models

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data analysis, Position papers, Theory

Languages Studied: English

Submission Number: 3470

Loading