Keywords: Large Language Models, Benchmark, Evaluation, Uncertainty Expression, Confidence estimation, Retrieval Augmented-Generation, Overconfidence
TL;DR: BLUFF-1000 benchmarks how LLMs handle linguistic confidence under imperfect retrieval situations and shows that most state of the art LLMs fail to adjust their certainty when evidence quality degrades.
Abstract: Retrieval-augmented generation (RAG) systems often fail to adequately modulate their linguistic certainty when evidence deteriorates. This gap in how models respond to imperfect retrieval is critical for the safety and reliability in real-world RAG systems. To address this gap, we propose BLUFF-1000, a benchmark systematically designed to evaluate how large language models (LLMs) manage linguistic confidence under conflicting evidence to simulate poor retrieval. We created a novel dataset, introduced two novel metrics, and calculated comprehensive metrics to quantify faithfulness, factuality, linguistic uncertainty, and calibration. Finally, we tested generation components of RAG systems with controlled experiments on seven LLMs using the benchmark, measuring their awareness of uncertainty and general performance. While not definitive, our observations reveal initial indications of a misalignment between uncertainty and source quality across seven state-of-the-art RAG systems, underscoring the value of continued benchmarking in this space. We recommend that future RAG systems refine uncertainty-aware methods to convey confidence throughout the system transparently.
Submission Number: 63
Loading