everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
The widespread use of LLM-as-a-judge raises concerns about their reliability and limitations in real-world applications. Existing studies have explored LLM-as-a-judge in both subjective and objective scenarios, but challenges still exist due to limited benchmark diversity, inherent data biases (e.g., length and style biases), and the lack of metrics to assess whether LLM-as-a-judge truly understands their own judgement boundaries. To address these shortcomings, we propose REAL-JUDGE, which contains 1,280 samples spanning 7 task types, specifically designed to minimize common evaluation biases. We also adopted more comprehensive evaluation methods, which enable us to effectively assess the calibration of LLM-as-a-judge. Our results reveal that even state-of-the-art models exhibit poor calibration and that different types of LLM-as-a-judge excel in distinct task categories, underscoring the need for context-specific model selection. In conclusion, we provide a bias-free dataset and a reliable method for evaluating LLM-as-a-judge.