SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim; Youngsoo Jang; Yu Jin Kim; Byoungjip Kim; Honglak Lee; Kyunghoon Bae; Moontae Lee

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Moontae Lee

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety Alignment, LLM Fine-tuning, Preferences, Large Language Models, AI Safety

TL;DR: we present a simple yef effective safety-alignment methods

Abstract: As large language models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into reinforcement learning from human feedback (RLHF). However, these approaches tend to be complex and often unstable, as they encompass complicated procedures in RLHF along with additional procedures required by the safety constraints. Inspired by direct preference optimization (DPO), we introduce a new algorithm called \textit{SafeDPO}, which is designed to implicitly optimize the safety alignment objective within a single stage of policy learning. The resulting algorithm can be implemented by introducing only one additional hyperparameter, which aims to further enhance safety, along with minor modifications to the DPO implementation. Consequently, SafeDPO successfully eliminates the necessity of fitting a reward and a cost model, as well as sampling from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to the current state-of-the-art safety alignment algorithm, both in terms of aligning with human preferences and improving safety.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5041

Loading