Mitigating Over-Refusal in Adversarial Tuning via Subspace-guided Sample Selection

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM safety
Abstract: As the adoption of large language models (LLMs) increases, their vulnerability to jailbreaks poses a significant concern. Adversarial tuning offers an effective means of enabling LLMs to resist jailbreak prompts, but it inevitably introduces the problem of over-refusal, where benign queries are mistakenly rejected, thereby comprising the model utility. To address the limitation, we propose the Soft Adversarial Tuning (SAT) framework, which selects “soft samples” that balance robustness and over-refusal for adversarial tuning. Specifically, SAT decomposes the model’s hidden states into two behavioral subspaces via representation engineering: one for producing robust responses to malicious queries and another for avoiding over-refusal on benign queries. By projecting the gradients of candidate adversarial-tuning samples onto these subspaces, we quantify each sample’s influence on jailbreak defense and over-refusal. We then select ”soft samples” that exert strong influence in the robustness subspace while having minimal effect in the over-refusal subspace for soft adversarial tuning. We evaluate SAT with six existing defense methods across different settings. Experimental results show that SAT consistently outperforms these methods, reducing the over-refusal rate by more than 22%, while maintaining an attack success rate below 2.8% against five representative jailbreak attacks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11594
Loading