Keywords: overrefusal, llm, delta debugging, safety alignment, prompt repair
TL;DR: We introduce DDOR, an automated, causally grounded framework that detects and reduces LLM overrefusal by extracting interpretable refusal triggers.
Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they also introduce the risk of overrefusal—unwarranted rejection of benign queries that only appear risky. We introduce DDOR (Delta Debugging for OverRefusal), a fully automated, causally grounded framework that generates interpretable test items with explicit refusal triggers. Unlike prior benchmarks that operate at a coarse prompt level or rely heavily on manual design, DDOR produces one thousand high-quality prompts per model and consistently increases measured overrefusal rates relative to seed sets, demonstrating strong diagnostic capability. Moreover, our mRTF-based repair method substantially lowers overrefusal rates without compromising safety on genuinely harmful inputs. By combining precise trigger isolation with scalable generation and principled filtering, DDOR provides a practical framework to both evaluate and mitigate overrefusal, thereby improving LLM usability while maintaining safety.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24901
Loading