From Threat to Tool: Leveraging a Refusal-Aware Injection Attack for Safety Alignment

ACL ARR 2025 May Submission8071 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Safety alignment of large language models (LLMs) has traditionally relied on costly human-annotated preference data. Recent efforts explore synthetic alternatives via prompt-based self-refinement, yet these methods remain inefficient and resource-intensive. In practice, safety-aligned models often exhibit degraded performance, highlighting the inadequacy of existing alignment data. In this work, we introduce a novel approach that repurposes LLM attacks for alignment data generation. Our method systematically detects refusal signals and appends predefined injection phrases to induce coherent harmful responses. Unlike prior methods that produce incoherent output or suffer from high model dependency, our approach is model-agnostic and enables scalable generation of natural and consistent alignment data. Experiments across diverse models and datasets demonstrate that our method yields high-quality alignment data that preserve the utility of the model while enhancing safety. Our findings suggest that our approach is not only a practical and scalable data augmentation strategy for safety alignment, but also a compelling LLM attack technique that sheds light on the behavioral vulnerabilities of aligned models.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: safety alignment, LLM, LLM Attack
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8071
Loading