Aligning Language Models Using Multi-Objective Deep Reinforcement Learning

Anonymous

Aligning Language Models Using Multi-Objective Deep Reinforcement Learning

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: The alignment techniques used in state-of-the-art language models (LMs), e.g., reinforcement learning from human feedback (RLHF), have driven many successful Natural Language Processing (NLP) tasks. RLHF uses human preferences based on the guideline of being helpful and safe as a single reward signal to fine-tune language models. However, the trade-offs between helpfulness and safety are often found to be a problem, which makes it difficult for a model trained toward one objective to perform well on both. In this paper, we propose a new alignment technique, named multi-objective language model alignment (MOLMA). The framework is based on multi-objective deep reinforcement learning to fine-tune language models. MOLMA can efficiently address the conflicting or the dominating learning signal issue, which is caused by the the trade-offs of inherent, often conflicting, multi-objectives underlying the language model alignment task. From the overall objective of achieving both helpfulness and safety, our results show that MOLMA outperforms the other alignment techniques that rely on single-objective deep reinforcement learning.

Paper Type: long

Research Area: Machine Learning for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Theory

Languages Studied: English

0 Replies

Loading