Keywords: Large Language Model
TL;DR: Our paper deeply analyze the challenges of large language models (LLMs) safety and propose a new framework to balance helpfulness and harmlessness.
Abstract: With the advancement of Large Language Models (LLMs), ensuring their safety has become a paramount concern. Alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), aligning LLM outputs with human values and intentions, greatly enhance the models' safety and utility. Normally, it is a common sense that alignment relies on the quality and quantity of safety data. However, our extensive experimental analysis reveals that integrating a large volume of safety-related data into the alignment process does not fully address all safety concerns, for instance, those arising from unknown safety knowledge, but degrades the models' general ability. To tackle this challenge, we investigate the root causes of LLM harmfulness, focusing on two key dimensions: inadequate safety alignment and insufficient safety knowledge. We delineate the boundaries of what can be achieved through alignment versus other security policies. In response, we introduce a fine-grained data identification strategy and an adaptive message-wise alignment approach, designed to obtain optimized alignment results with minimal safety data, thereby balance the models' safety and general performance. Furthermore, to mitigate the lack of comprehensive safety knowledge, we propose a harmful token filtering mechanism to be applied during the inference phase. Our experimental results indicate that our proposed approaches significantly enhance both the safety and the general performance of LLMs, thus laying the groundwork for more dependable and versatile applications in natural language processing.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3697
Loading