Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges

Published: 22 Sept 2025, Last Modified: 03 Jan 2026WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs as judges, natural language generation evaluation, measurement theory
Submission Number: 406
Loading