A Comprehensive Evaluation of Large Language Models for Retrieval-Augmented Generation under Noisy Conditions

A Comprehensive Evaluation of Large Language Models for Retrieval-Augmented Generation under Noisy Conditions

ACL ARR 2025 July Submission906 Authors

29 Jul 2025 (modified: 02 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-Augmented Generation (RAG) has emerged as an effective strategy to ground Large Language Models (LLMs) with reliable, real-time information. This paper investigates the trade-off between cost and performance by evaluating 13 LLMs within a RAG pipeline for the Question Answering (Q&A) task under noisy retrieval conditions. We assess four extractive and nine generative models—spanning both open- and closed-source ones of varying sizes—on a journalistic benchmark specifically designed for RAG. By systematically varying the level of noise injected into the retrieved context, we analyze not only which models perform best, but also their robustness to noisy input. Results show that large open-source generative models (approx. 70B parameters) achieve performance and noise tolerance on par with top-tier closed-source models. However, their computational demands limit their practicality in resource-constrained settings. In contrast, medium-sized open-source models (approx. 7B parameters) emerge as a compelling compromise, balancing efficiency, robustness, and accessibility.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: applications,prompting,retrieval-augmented generation,robustness

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 3

B2 Discuss The License For Artifacts: No

B2 Elaboration: We cite the models' and the dataset's reference papers. The models' licenses are in their Hugging Face repositories, and the dataset license is located in their GitHub repository.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: 3

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 3

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 3

C3 Descriptive Statistics: Yes

C3 Elaboration: 4

C4 Parameters For Packages: Yes

C4 Elaboration: 3

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Yes, we used AI writing assistance just for translation

Author Submission Checklist: yes

Submission Number: 906

Loading