Abstract: Relevance judgments for information retrieval (IR) evaluation, once the domain of human assessors, are now often produced by Large Language Models (LLMs). While some studies report alignment between LLM and human judgments, claims that LLMs can replace human judges raise concerns about reliability, validity, and longterm impact. As IR systems increasingly rely on LLM-generated signals, evaluation risks becoming self-reinforcing, leading to potentially misleading conclusions. This paper examines scenarios where LLM evaluators may falsely indicate success, particularly when LLM-based judgments influence both system development and evaluation. We highlight key risks, including bias reinforcement, reproducibility challenges, and inconsistencies in assessment methodologies. To address these concerns, we propose tests to quantify adverse effects, guardrails, and a collaborative framework for constructing reusable test collections that integrate LLM judgments responsibly. By providing perspectives from academia and industry, this work aims to establish best practices for the principled use of LLMs in IR evaluation.
Loading