Neural Retrievers are Biased Towards LLM-Generated Content

Sunhao Dai; Yuqi Zhou; Liang Pang; Weihao Liu; Xiaolin Hu; Yong Liu; Xiao Zhang; Gang Wang; Jun Xu

Neural Retrievers are Biased Towards LLM-Generated Content

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu

Published: 05 Mar 2024, Last Modified: 22 May 2024ICLR 2024 AGI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Information Retrieval, LLM-Generated Texts, Artificial Intelligence Generated Content

TL;DR: This paper uncovers and analyzes that neural retrievers are biased towards LLM-generated texts, and further propose a debiased method to mitigate this bias.

Abstract: Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search, by generating vast amounts of human-like texts on the Internet. As a result, IR systems in the LLM era are facing a new challenge: the indexed documents are now not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrievers towards the LLM-generated text as the **source bias**. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, in-depth analyses from the perspective of text compression indicate that LLM-generated texts exhibit more focused semantics with less noise, making them easier for neural retrievers to semantic match. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show its effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes are available in the link https://github.com/KID-22/LLM4IR-Bias.

Submission Number: 29

Loading