Keywords: retrieval-augmented generation, over-refusal, safety alignment, representation steering, large language models, benchmarks, AI Alignment
Abstract: Safety alignment in large language models (LLMs) induces over-refusals—where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning six domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement, contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce SafeRAG-Steering, a model-centric embedding intervention that steers the embedding regions towards empirically non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines.
Paper Type: Short
Research Area: Safety and Alignment in LLMs
Research Area Keywords: retrieval-augmented generation, safety alignment, refusal behavior, representation learning, model steering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 9939
Loading