Keywords: Retrieval-Augmented Generation, Large Language Models, Evidence Aggregation, Knowledge Conflicts, Illusory Truth Effect, Primacy Bias, RAG Systems, Heuristic Reasoning, Model Robustness, Question Answering
Abstract: Retrieval-Augmented Generation (RAG) is the prevailing paradigm for grounding Large Language Models (LLMs), yet the mechanisms governing $\textit{how}$ models integrate groups of conflicting retrieved evidence remain opaque. Does an LLM answer a certain way because the evidence is factually strong, because of a prior belief, or merely because it is repeated frequently? To answer this, we introduce $\textbf{GroupQA}$, a curated dataset of 1,635 controversial questions paired with 15,058 diversely-sourced evidence documents, annotated for stance and qualitative strength. Through controlled experiments, we characterize group-level evidence aggregation dynamics: Paraphrasing an argument can be more persuasive than providing distinct independent support; Models favor evidence presented first rather than last, and Larger models are increasingly resistant to adapt to presented evidence. Additionally, we find that LLM explanations to group-based answers are unfaithful. Together, we show that LLMs behave consistently as vulnerable heuristic followers, with direct implications for improving RAG system design.
Paper Type: Long
Research Area: Retrieval-Augmented Language Models
Research Area Keywords: Question Answering, Interpretability and Analysis of Models for NLP, Retrieval-Augmented Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 8485
Loading