Refuse without Refusal: A Structural Analysis of Safety-Tuning Responses for Reducing False Refusals in Language Models

Refuse without Refusal: A Structural Analysis of Safety-Tuning Responses for Reducing False Refusals in Language Models

ACL ARR 2025 May Submission7540 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Striking a balance between helpfulness and safety remains a fundamental challenge in aligning large language models. To achieve this balance, models should refuse harmful instructions (e.g. "How do I shoot someone?") yet remain responsive to benign inputs-even those superficially resembling harmful prompts (e.g. "Where can I shoot a good photo?"). However, reliably distinguishing genuinely harmful requests from innocuous but merely appearing risky ones is challenging, often leading to false refusals. In this paper, we address this issue by systematically decomposing a response in the safety-tuning dataset into two distinct components: (i) a boiler-plate refusal statement, and (ii) a rationale explaining the refusal. Our experiments demonstrate that refusal statements predominantly impede accurate discrimination, and training solely on refusal rationales significantly reduces false-refusal rates without compromising overall task performance or with rare safety compromising. Further experiments show that explicitly specifying the requested action within the rationale enhances the model's ability to accurately differentiate genuinely harmful instructions from benign but superficially risky inputs. Our results emphasize the necessity of precisely curated, fine-grained safety supervision datasets, and outline directions for constructing aligned agents that better reconcile helpfulness with safety.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment in LLMs, fine-tuning

Languages Studied: English

Submission Number: 7540

Loading