Can we trust LLMs as relevance judges?

Luciana Bencke, Felipe Paula, Viviane Moreira, Bruno G. T. dos Santos

Published: 10 Oct 2024, Last Modified: 06 Feb 2025Proceedings of the XXXIX Brazilian Symposium on DatabasesEveryoneCC BY 4.0

Abstract: Evaluation is key for Information Retrieval systems and requires test collections consisting of documents, queries, and relevance judgments. Obtaining relevance judgments is the most costly step in creating test collections because they demand human intervention. A recent tendency in the area is to replace humans with Large Language Models (LLMs) as the source of relevance judgments. In this paper, we investigate the use of LLMs as a source of relevance judgments. Our goal is to find out how reliable LLMs are in this task. We experimented with different LLMs and test collections in Portuguese. Our results show that LLMs can yield promising performance that is competitive with human annotations.