LLM-Augmented Relevance Feedback: Generative Feedback with Automatic LLM Judges for Conversational Search

LLM-Augmented Relevance Feedback: Generative Feedback with Automatic LLM Judges for Conversational Search

ACL ARR 2025 May Submission2199 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent research on Large Language Model (LLM) judges in search largely focuses on their role as offline evaluators. Instead, this paper investigates using LLMs closer to simulation, focusing on using them as proxies for human feedback. We present LLM-Augmented Relevance Feedback (LARF), which synthesises the latest LLM Judge methods with \textit{Query Reformulation} and \textit{Query-by-Document} relevance feedback integration approaches to improve the set of candidate documents. We perform experiments on standard conversational search benchmarks, TREC \textsc{iKAT} and \textsc{CAsT}. Our work address three research questions: (1) What is the retrieval benefit when LARF is used with human feedback? (2) How does noise in relevance judgements impact downstream feedback effectiveness? (3) What are issues with the current LLM judges when used with LARF? We find that with human judgements, \textit{Query-by-Document} achieves new state-of-the-art results, significantly outperforming previous work (48\% nDCG@3 on CAsT). We study how effectiveness degrades as judgements become noisier. And, when using current automatic LLM judges, we find 18\% nDCG@3 gain over previous state-of-the-art on CAsT. We conclude that LARF offers a new and effective mechanism for improving retrieval quality in conversational search and highlight the need for reducing noise, particularly for complex personalised tasks.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Information Retrieval, Relevance Feedback, User Simulation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2199

Loading