The Need for a Leaderboard: A Survey of LLM as a Judge in NLP

ACL ARR 2024 June Submission2693 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, the use of large language model (LLM) as a judge gains popularity in Natural Language Processing (NLP) research. This paper reviews recent studies on LLM-as-a-judge, revealing significant efforts in developing various methods for LLM-based assessment. However, there is a lack of a common standard for meta-evaluations, and several potential risks associated with LLMs need to be acknowledged. Therefore, we recommend creating a leaderboard and offer a draft proposal to support the development and adoption of LLM-as-a-judge.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, evaluation
Contribution Types: Surveys
Languages Studied: English, German, Russian, Chinese
Submission Number: 2693