REAL-JUDGE: When Should We Trust LLM Judges? An Exploration of Uncertainty Measurement

ACL ARR 2025 February Submission4443 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

The widespread use of LLM-as-a-judge raises concerns about their reliability and limitations in real-world applications. Existing studies have explored LLM-as-a-judge in both subjective and objective scenarios, but challenges still exist due to limited benchmark diversity, inherent data biases (e.g., length and style biases), and the lack of metrics to assess whether LLM-as-a-judge truly understands their own judgement boundaries. To address these shortcomings, we propose REAL-JUDGE, which contains 1,280 samples spanning 7 task types, specifically designed to minimize common evaluation biases. We also adopted more comprehensive evaluation methods, which enable us to effectively assess the calibration of LLM-as-a-judge. Our results reveal that even state-of-the-art models exhibit poor calibration and that different types of LLM-as-a-judge excel in distinct task categories, underscoring the need for context-specific model selection. In conclusion, we provide a bias-free dataset and a reliable method for evaluating LLM-as-a-judge.

Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: Chinese, English
Submission Number: 4443
Loading