From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

ICLR 2026 Conference Submission17866 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: safety, LLMs, alignment

Abstract: Safely aligning large language models (LLMs) is a critical challenge: reliable safety requires large amounts of human-labeled preference data, and collecting such data is expensive, slow, and often infeasible at scale. We present Refusal-Aware Adaptive Injection (RAAI), an attack-style method that directly and simply induces LLMs to produce harmful completions, and which we repurpose as a practical tool for gathering safety-alignment data. Concretely, RAAI detects internal refusal signals emitted by an LLM and adaptively injects predefined, tailored phrases into prompts so as to bypass refusals and elicit harmful but fluent responses. Unlike prior attack or data-synthesis approaches that rely on complex iterative prompt engineering or auxiliary models, RAAI is training-free, model-agnostic, and operates with minimal orchestration, making it efficient to deploy across models. Evaluated on four jailbreak benchmarks, RAAI raises the rate of harmful completions from a baseline of 2.15\% to up to 61.04\%, demonstrating its effectiveness at producing challenging negative examples that are otherwise difficult to obtain. Fine-tuning LLMs using RAAI-generated data substantially improves robustness to harmful prompts while preserving performance on standard benchmarks (e.g., MMLU, ARC). By showing how the proposed RAAI attack method can be reframed as a controlled data-collection instrument, we turn a security risk into a scalable asset for LLM safety alignment.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 17866

Loading