Abstract: The alignment techniques used in state-of-the-art language models (LMs), e.g., reinforcement learning from human feedback (RLHF), have driven many successful natural language processing (NLP) tasks. RLHF uses human preferences based on the guidelines of being helpful and safe as a *single* reward signal to fine-tune language models. However, the trade-offs between helpfulness and safety are often found to be a problem, which makes it difficult for a model trained towards one objective to perform well on both. This paper proposes a new alignment technique, multi-objective language model alignment (MOLMA). The framework is based on *multi*-objective deep reinforcement learning to fine-tune language models. MOLMA can efficiently address the conflicting or the dominating learning signal issue caused by the trade-offs of inherent, often conflicting, multi-objectives underlying the language model alignment task. From the overall objective of achieving helpfulness and safety, our results show that MOLMA outperforms the other alignment techniques that rely on single-objective deep reinforcement learning.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning, optimization methods, generative models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data analysis, Position papers, Theory
Languages Studied: English
Submission Number: 3470
Loading