Automated Overrefusal Prompt Generation and Repair with Delta Debugging

Automated Overrefusal Prompt Generation and Repair with Delta Debugging

ICLR 2026 Conference Submission24901 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: overrefusal, llm, delta debugging, safety alignment, prompt repair

TL;DR: We introduce DDOR, an automated, causally grounded framework that detects and reduces LLM overrefusal by extracting interpretable refusal triggers.

Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they also introduce the risk of overrefusal—unwarranted rejection of benign queries that only appear risky. We introduce DDOR (Delta Debugging for OverRefusal), a fully automated, causally grounded framework that generates interpretable test items with explicit refusal triggers. Unlike prior benchmarks that operate at a coarse prompt level or rely heavily on manual design, DDOR produces one thousand high-quality prompts per model and consistently increases measured overrefusal rates relative to seed sets, demonstrating strong diagnostic capability. Moreover, our mRTF-based repair method substantially lowers overrefusal rates without compromising safety on genuinely harmful inputs. By combining precise trigger isolation with scalable generation and principled filtering, DDOR provides a practical framework to both evaluate and mitigate overrefusal, thereby improving LLM usability while maintaining safety.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 24901

Loading