Abstract: We introduce the concept of context-driven over-refusal, an abstention arising when model's safety guardrails are triggered by the grounding knowledge provided alongside the user's request. Distinct from question-driven over-refusal, this occurs in both retrieval-augmented generation (RAG) and natural language processing (NLP) task completion (e.g. summarization, translation) where external content can unexpectedly trigger refusals. In this work, we present a novel two-stage evaluation framework named COVER, designed to quantify and analyze this behavior. Through a comprehensive empirical study on two public corpora, we show that over-refusal rates strongly depend on the task, system prompts, model family, and the number of retrieved documents. We observe that tasks such as translation and summarization yield disproportionately high over-refusal rates, while question-answering remains relatively robust, especially in newer models. Moreover, increasing the number of contextual documents tends to reduce refusals, yet broadens the pool of prompts at risk of encountering at least one "unsafe" text. Interestingly, strict system prompts do not necessarily lead to higher over-refusal rates, suggesting that in the absence of explicit directives, some models may default to a more cautious behavior. These findings highlight the need for fine-grained alignment and benchmarking strategies sensitive to both user intent and contextual nuances, offering a roadmap for future research in model training and evaluation.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: model bias/fairness evaluation; ethical considerations in NLP applications
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7673
Loading