Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

ACL ARR 2024 December Submission1566 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning (RL) algorithms for large language models (LLMs) safety alignment, such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current strategies typically mitigate this challenge by sampling from the target policy, an approach that demands substantial computational resources. In this paper, we hypothesize that during DPO training, the ranking of top items changes while their distribution remains largely unchanged, which allows us to transform the sampling process from the target policy into a re-ranking of the preference data. Based on this hypothesis, we propose a new framework that leverages the model's internal safety judgment capability to extract reward signals and use label confidence to efficiently simulate the sampling process. Theoretical analysis and experimental results on multiple public safety test sets and open-source safety evaluation models demonstrate that our method effectively reduces the incidence of harmful responses while having significantly lower training costs.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security and privacy,
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1566
Loading