Abstract: Automatic model assessment has long been a critical challenge. Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the ``LLM-as-a-judge'' paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: \textit{what} to judge, \textit{how} to judge, and \textit{how} to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area. We have released and will maintain a paper list about \textbf{LLM-as-a-judge} at: \url{https://anonymous.4open.science/r/Awesome-LLM-as-a-judge-266D}.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, LLM-as-a-judge, Model Evaluation
Contribution Types: Model analysis & interpretability, Surveys
Languages Studied: English
Submission Number: 6253
Loading