Context Example Selection for LLM Generated Relevance Assessments

Published: 2025, Last Modified: 22 Jan 2026ECIR (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Evaluating information retrieval systems typically requires large test collections with human-annotated relevance judgements. However, labelling large collections for relevance is an expensive and time-consuming process. Recently, previous work has explored the possibility of using large language models (LLMs) to generate relevance labels. Typically, such works deploy zero-shot prompting or in-context learning (ICL) strategies. However, ICL requires careful selection of representative examples to demonstrate the judging task to the LLM. In this work, we investigate strategies for selecting informative examples to include as context to help an LLM to generate discriminative relevance assessments. Our experiments using the TREC DL 2019 and 2020 test collections show that including informative examples as context can lead to notable improvements in the quality of generated relevance assessments. In particular, we found that our same query random document example selection strategy resulted in +37% \(\kappa \) agreement between LLM-generated and human-labelled relevance assessments, and +17% \(\tau \) agreement in the subsequent ranking of systems by the generated qrels, compared to a zero-shot baseline strategy. Such improvements in LLM-generated qrels can help to reduce the amount of human relevance assessments that are required to create test collections and thus facilitate quicker and easier information retrieval research.
Loading