LLM-Augmented Relevance Feedback: Generative Feedback with Automatic LLM Judges for Conversational Search
Abstract: Recent research on Large Language Model (LLM) judges in search largely focuses on their role as offline evaluators. Instead, this paper investigates using LLMs closer to simulation, focusing on using them as proxies for human feedback. We present LLM-Augmented Relevance Feedback (LARF), which synthesises the latest LLM Judge methods with \textit{Query Reformulation} and \textit{Query-by-Document} relevance feedback integration approaches to improve the set of candidate documents. We perform experiments on standard conversational search benchmarks, TREC \textsc{iKAT} and \textsc{CAsT}. Our work address three research questions: (1) What is the retrieval benefit when LARF is used with human feedback? (2) How does noise in relevance judgements impact downstream feedback effectiveness? (3) What are issues with the current LLM judges when used with LARF? We find that with human judgements, \textit{Query-by-Document} achieves new state-of-the-art results, significantly outperforming previous work (48\% nDCG@3 on CAsT). We study how effectiveness degrades as judgements become noisier. And, when using current automatic LLM judges, we find 18\% nDCG@3 gain over previous state-of-the-art on CAsT. We conclude that LARF offers a new and effective mechanism for improving retrieval quality in conversational search and highlight the need for reducing noise, particularly for complex personalised tasks.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Information Retrieval, Relevance Feedback, User Simulation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2199
Loading