Building Helpful-Only Large Language Models: A Complete Approach from Motivation to Evaluation

ACL ARR 2025 May Submission4213 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning from AI feedback (RLAIF) is widely used for customizing the safety policies of large language models (LLMs) at scale. However, standard aligned LLMs are poorly suited in this setting, as their fixed alignment prevents adaptation to new policies. To address this, prior works have employed $\textbf{Helpful-Only LLMs (HOLLMs)}$. Despite their effectiveness, no public framework exists for training or evaluating HOLLMs. In this paper, we present a comprehensive framework for developing HOLLMs that enable custom safety alignment. We first define the key attributes of a HOLLM and then propose $\textbf{Refusal-Avoidant Instruction Learning (RAIL)}$, a novel training method that constructs HOLLMs from open-source datasets. We also introduce a comprehensive evaluation framework including a new benchmark: $\textbf{Helpfulness Evaluation without Limitations from Policies (HELP)}$. Experiments show that the HOLLM achieves a 30.28\% reduction in refusal rate over the strongest refusal-optimized baseline without compromising general capabilities. The HOLLM also achieves a 29.25\% higher accuracy on HELP compared to the best-performing baseline. These results demonstrate that RAIL effectively cultivates the key attributes required of a HOLLM.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: fine-tuning, safety and alignment, red teaming
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Keywords: fine-tuning, safety and alignment, red teaming
Submission Number: 4213
Loading