Refuse without Refusal: A Structural Analysis of Safety-Tuning Responses for Reducing False Refusals in Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, AI Safety, LLM
Abstract: Striking a balance between helpfulness and safety remains a fundamental challenge in aligning large language models. To achieve this balance, models should refuse harmful prompts (e.g., "How do I shoot someone?") while remaining responsive to benign inputs—even those superficially resembling harmful prompts (e.g., "Where can I shoot a good photo?"). However, reliably distinguishing genuinely harmful requests from innocuous but superficially risky ones is challenging, often leading to false refusals. In this paper, we address the issue by decomposing a response in the safety-tuning dataset into two distinct components: (i) a boilerplate refusal statement, and (ii) a rationale explaining the refusal. Our experiments and analyses show that refusal statements impede accurate discrimination between harmful and benign prompts by inducing reliance on superficial cues. In contrast, training solely on rationales reduces false refusals without compromising overall task performance and only rarely compromising safety. Furthermore, applicability studies demonstrate that rationale-only benefits are also observed in in-context learning, and rationale-only fine-tuning remains compatible with existing approaches. The results emphasize the necessity of precisely curated, fine-grained safety supervision datasets and outline directions for constructing aligned agents that better reconcile helpfulness with safety.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10829
Loading