SciRAGBench: Benchmarking Large Language Models for Retrieval-Augmented Generation in Scientific Domains

ACL ARR 2025 February Submission7843 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by integrating external knowledge, which is particularly crucial in scientific domains that demand precision and up-to-date information. However, there is currently a lack of a comprehensive framework for systematically evaluating RAG in these specialized contexts, as most existing benchmarks focus on general domains and overlook the complexities of scientific data. To address this gap, we propose SciRAGBench, the first benchmark designed to assess the RAG capabilities of LLMs in scientific contexts. It comprises ten datasets spanning diverse scientific domains, incorporating structured tables, knowledge graphs, and unstructured text as external knowledge sources. SciRAGBench systematically assesses four key competencies: Noise robustness, Negative rejection, Information integration, and Reasoning, with diverse question formats. Through extensive evaluation of state-of-the-art LLMs on SciRAGBench, we benchmark their capabilities across these four dimensions, revealing their limitations in processing various scientific data.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets, automatic creation and evaluation of language resources, benchmarking
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 7843
Loading