Do Metadata and Appearance of the Retrieved Webpages Affect LLM's Reasoning in Retrieval-Augmented Generation?

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Track: Full paper
Keywords: knowledge conflict, retrieval-augmented, LLM, Model analysis & interpretability
TL;DR: We show that when LLMs are presented with two contradicting webpages, we can change the LLM's answer by manipulating the publication date, source, and appearance of the webpages.
Abstract: Large language models (LLMs) answering questions with retrieval-augmented generation (RAG) can face conflicting evidence in the retrieved documents. While prior works study how textual features like perplexity and readability influence the persuasiveness of evidence, humans consider more than textual content when evaluating conflicting information on the web. In this paper, we focus on the following question: When two webpages contain conflicting information to answer a question, does non-textual information affect the LLM's reasoning and answer? We consider three types of non-textual information: (1) the webpage's publication time, (2) the source where the webpage is from, and (3) the appearance of the webpage. We give the LLM a Yes/No question and two conflicting webpages that support yes and no, respectively. We exchange the non-textual information in the two webpages to see if the LLMs tend to use the information from a newer, more reliable, and more visually appealing webpage. We find that changing the publication time of the webpage can change the answer for most LLMs, but changing the webpage's source merely affects the LLM's answer. We also reveal that the webpage's appearance has a strong causal effect on Claude-3's answers.The codes and datasets used in the paper are available at https://github.com/d223302/rag-metadata.
Copyright PDF: pdf
Submission Number: 67
Loading