TL;DR: an over-refusal benchmark for large language model
Abstract: Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation,
the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful.
Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging
due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs.
This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses.
We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench.
We hope this benchmark can help the community develop better safety aligned models.
Lay Summary: As language models become safer to prevent harmful outputs, they increasingly exhibit over-refusal—rejecting even harmless prompts. This reduces their usefulness and creates a trade-off between safety and helpfulness. However, the lack of a large-scale benchmark makes it difficult to study or quantify this issue systematically.
We developed an automated pipeline that rewrites toxic prompts into safe yet borderline cases likely to be wrongly rejected. Using this, we built OR-Bench, the first large-scale benchmark for over-refusal, with 80,000 prompts across 10 categories, a hard subset of 1,000 challenging cases, and 600 toxic prompts for safety evaluation.
Our benchmark enables consistent evaluation of over-refusal across models. We tested 32 leading LLMs and uncovered key trade-offs between safety and helpfulness. OR-Bench provides the community with a critical tool to design models that are both safe and practically useful.
Link To Code: https://github.com/justincui03/or-bench
Primary Area: Social Aspects->Safety
Keywords: safety, llm, over-refusal
Submission Number: 15599
Loading