HELPFUL-ONLY LARGE LANGUAGE MODEL

Donghyeon Ko; Donghyun Kwak

HELPFUL-ONLY LARGE LANGUAGE MODEL

Donghyeon Ko, Donghyun Kwak

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, AI alignment, Large Language Model, Offline Reinforcement Learning, Data Selection, Machine Learning, Deep Learning, Natural Language Processing

TL;DR: Refusal-free training method to reach a reproducible Helpful-Only Large Language Model

Abstract: To know your enemy, you must become your enemy. Sun Tzu stated in $\textit{The Art of War}$. Often, it is crucial to synthesize data containing harmful content using large language models (LLMs) in order to train harmless LLMs. Methods by which synthesized data can be utilized include using it as training data to provide negative signals to the model, as automatic red-teaming data to identify vulnerabilities of the model and more. However, aligned LLMs struggle to generate harmful responses. In this paper, we propose the $\textit{refusal-free}$ training method to reach a $\textbf{Helpful-Only LLM}$ that maintains the helpfulness of the state-of-the-art (SOTA) LLMs while allowing harmful response generation. The $\textit{refusal-free}$ training method filters the instances that refuse an user's request from the datasets. We demonstrate that the $\textit{refusal-free}$ training dramatically decreases the rate at which the LLM generates refusal responses (refusal rate) by 60.12% without sacrificing its helpfulness. Also, we are aware of the possibility that the progress in this direction could lead to irreversible consequences. A powerful model that does not reject harmful requests and executes them all could be exploited for illicit purposes such as the creation of indiscriminate weapons or hacking. However, once again, we believe it is important to be the one to break an LLM and study how an LLM can be broken in advance, including understanding the boundaries a $\textbf{Helpful-Only LLM}$ can reach and identifying its inherent tendencies. We emphasize that this study is wholly for academic purpose and is aimed at paving the way toward a harmless LLM. This study calls for the researchers to acknowledge the potential failures of LLMs and take steps to prevent the breakdowns. $\textbf{Content Warning:}$ This paper contains examples that may be offensive in nature, and reader discretion is recommended.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9715

Loading