Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

ACL ARR 2025 May Submission2056 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing benchmarks often assume a singular view of optimal information use, neglecting diverse user needs where 'correctness' can mean faithfulness to instructed sources over factual recall. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases—Context-Exclusive, Context-First, and Memory-First—across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts, explicitly separating factual correctness from faithfulness‑to‑instruction so readers know which dimension each score reflects. We will release our code and URAQ dataset upon acceptance of the paper.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Retrieved Augmented Generation, Interpretability and Analysis of Models for NLP, Faithfulness, Resources and Evaluation, Question Answering, Information Retrieval and Text Mining, Generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2056
Loading