Refuse without Refusal: A Structural Analysis of Safety-Tuning Responses for Reducing False Refusals in Language Models
Abstract: Striking a balance between helpfulness and safety remains a fundamental challenge in aligning large language models. To achieve this balance, models should refuse harmful instructions (e.g. "How do I shoot someone?") yet remain responsive to benign inputs-even those superficially resembling harmful prompts (e.g. "Where can I shoot a good photo?"). However, reliably distinguishing genuinely harmful requests from innocuous but merely appearing risky ones is challenging, often leading to false refusals. In this paper, we address this issue by systematically decomposing a response in the safety-tuning dataset into two distinct components: (i) a boiler-plate refusal statement, and (ii) a rationale explaining the refusal. Our experiments demonstrate that refusal statements predominantly impede accurate discrimination, and training solely on refusal rationales significantly reduces false-refusal rates without compromising overall task performance or with rare safety compromising. Further experiments show that explicitly specifying the requested action within the rationale enhances the model's ability to accurately differentiate genuinely harmful instructions from benign but superficially risky inputs. Our results emphasize the necessity of precisely curated, fine-grained safety supervision datasets, and outline directions for constructing aligned agents that better reconcile helpfulness with safety.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment in LLMs, fine-tuning
Languages Studied: English
Submission Number: 7540
Loading