Abstract: Traditionally, relevance judgments have relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as a proxy for relevance judgments. In this setting, a key yet underexplored factor is the choice of relevance scale. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effects of scale conversions, and their role in the presence of potential data contamination remain unclear.
External IDs:dblp:conf/ecir/ZamoloLSDMR26
Loading