Keywords: safety, LLMs, alignment
Abstract: Safely aligning large language models (LLMs) is a critical challenge: reliable safety requires large amounts of human-labeled preference data, and collecting such data is expensive, slow, and often infeasible at scale. We present Refusal-Aware Adaptive Injection (RAAI), an attack-style method that directly and simply induces LLMs to produce harmful completions, and which we repurpose as a practical tool for gathering safety-alignment data. Concretely, RAAI detects internal refusal signals emitted by an LLM and adaptively injects predefined, tailored phrases into prompts so as to bypass refusals and elicit harmful but fluent responses. Unlike prior attack or data-synthesis approaches that rely on complex iterative prompt engineering or auxiliary models, RAAI is training-free, model-agnostic, and operates with minimal orchestration, making it efficient to deploy across models. Evaluated on four jailbreak benchmarks, RAAI raises the rate of harmful completions from a baseline of 2.15\% to up to 61.04\%, demonstrating its effectiveness at producing challenging negative examples that are otherwise difficult to obtain. Fine-tuning LLMs using RAAI-generated data substantially improves robustness to harmful prompts while preserving performance on standard benchmarks (e.g., MMLU, ARC). By showing how the proposed RAAI attack method can be reframed as a controlled data-collection instrument, we turn a security risk into a scalable asset for LLM safety alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17866
Loading