Keywords: benchmarking, evaluation methodologies, automatic evaluation of datasets, retrieval-augmented generation, safety and alignment
TL;DR: We introduce RefusalBench, a generative methodology for creating diagnostic benchmarks to evaluate a language model's ability to selectively refuse to answer questions based on flawed context, revealing critical safety gaps in frontier models.
Abstract: The ability of language models in RAG systems to selectively refuse answers based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit artifacts and memorize instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks—RefusalBench-NQ and RefusalBench-GaRAGe, and our complete generation framework to enable continued, dynamic evaluation.
Submission Number: 120
Loading