Keywords: AI Safety, AI alignment, Large Language Model, Offline Reinforcement Learning, Data Selection, Machine Learning, Deep Learning, Natural Language Processing
TL;DR: Refusal-free training method to reach a reproducible Helpful-Only Large Language Model
Abstract: To know your enemy, you must become your enemy. Sun Tzu stated in $\textit{The Art of War}$. Often, it is crucial to synthesize data containing harmful content using large language models (LLMs) in order to train harmless LLMs. Methods by which synthesized data can be utilized include using it as training data to provide negative signals to the model, as automatic red-teaming data to identify vulnerabilities of the model and more. However, aligned LLMs struggle to generate harmful responses. In this paper, we propose the $\textit{refusal-free}$ training method to reach a $\textbf{Helpful-Only LLM}$ that maintains the helpfulness of the state-of-the-art (SOTA) LLMs while allowing harmful response generation. The $\textit{refusal-free}$ training method filters the instances that refuse an user's request from the datasets. We demonstrate that the $\textit{refusal-free}$ training dramatically decreases the rate at which the LLM generates refusal responses (refusal rate) by 60.12% without sacrificing its helpfulness. Also, we are aware of the possibility that the progress in this direction could lead to irreversible consequences. A powerful model that does not reject harmful requests and executes them all could be exploited for illicit purposes such as the creation of indiscriminate weapons or hacking. However, once again, we believe it is important to be the one to break an LLM and study how an LLM can be broken in advance, including understanding the boundaries a $\textbf{Helpful-Only LLM}$ can reach and identifying its inherent tendencies. We emphasize that this study is wholly for academic purpose and is aimed at paving the way toward a harmless LLM. This study calls for the researchers to acknowledge the potential failures of LLMs and take steps to prevent the breakdowns. $\textbf{Content Warning:}$ This paper contains examples that may be offensive in nature, and reader discretion is recommended.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9715
Loading