Mitigating Over-Refusal in Adversarial Tuning via Subspace-guided Sample Selection

Ziqi Zhu; Zhibo Wang; Huiyu Xu; Jiacheng Du; Yajie Zhou; Kui Ren

Mitigating Over-Refusal in Adversarial Tuning via Subspace-guided Sample Selection

Ziqi Zhu, Zhibo Wang, Huiyu Xu, Jiacheng Du, Yajie Zhou, Kui Ren

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safety

Abstract: As the adoption of large language models (LLMs) increases, their vulnerability to jailbreaks poses a significant concern. Adversarial tuning offers an effective means of enabling LLMs to resist jailbreak prompts, but it inevitably introduces the problem of over-refusal, where benign queries are mistakenly rejected, thereby comprising the model utility. To address the limitation, we propose the Soft Adversarial Tuning (SAT) framework, which selects “soft samples” that balance robustness and over-refusal for adversarial tuning. Specifically, SAT decomposes the model’s hidden states into two behavioral subspaces via representation engineering: one for producing robust responses to malicious queries and another for avoiding over-refusal on benign queries. By projecting the gradients of candidate adversarial-tuning samples onto these subspaces, we quantify each sample’s influence on jailbreak defense and over-refusal. We then select ”soft samples” that exert strong influence in the robustness subspace while having minimal effect in the over-refusal subspace for soft adversarial tuning. We evaluate SAT with six existing defense methods across different settings. Experimental results show that SAT consistently outperforms these methods, reducing the over-refusal rate by more than 22%, while maintaining an attack success rate below 2.8% against five representative jailbreak attacks.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 11594

Loading