Abstract: To know your enemy, you must become your enemy. Sun Tzu stated in $\textit{The Art of War}$. Often, it is crucial to synthesize data containing harmful content using large language models (LLMs) in order to train harmless and helpful LLMs. For instance, reinforcement learning from artificial intelligence feedback (RLAIF), one of the most widely adopted methods to align an LLM, requires the ability to perform objective critiques of harmful responses, even if it means assessing that a harmful response was helpful-a judgment that could itself be considered harmful depending on the context. However, an LLM aligned with a specific policy struggles to follow instructions that contradict the policy, such as tasks where it requires to generate incentivizing expressions toward responses considered harmful according to its policy. In this paper, we propose the \textit{refusal-free} training method to reach $\textbf{Helpful-Only LLM (HOLLM)}$ that maintains the helpfulness of state-of-the-art (SOTA) LLMs while eliminating such limitations. Additionally, we introduce two benchmarks: (1) $\textbf{Refusal-Bench (RB)}$, and (2) $\textbf{Unsafe-Helpful-Rank (UHR)}$ to demonstrate the application of $\textbf{HOLLM}$ and evaluate its performance. We observe that the $\textit{refusal-free}$ training dramatically decreases the rate at which the LLM generates refusal responses, or refusal rate (RR) by 71.59% on $\textbf{RB}$, and increases the accuracy by 132.23% on $\textbf{UHR}$ without sacrificing its helpfulness.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, model bias/fairness evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 6771
Loading