Automatic Evaluating Scientific Reviews Through Meta-reviewer's Lens: A Reliable Benchmark for Peer Review Generation

Published: 2025, Last Modified: 22 Jan 2026NLPCC (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Peer review, essential for scientific manuscript assessment, faces efficiency challenges due to increasing submissions, demanding reliable automatic metrics for review quality evaluation. The currently most widely used word-overlap and embedding-based metrics using single peer review references show significant discrepancies with human judgments. To address these issues, we propose a novel automatic metric for evaluating the quality of peer reviews, designed to analyze the quality of atomic review opinions (AROs) using large language models, named ReviewScore. Besides, we construct a high-quality benchmark through human annotations to reliably measure ReviewScore and peer review generation models, called ReviewEval. The reference reviews in ReviewEval are crafted by integrating valuable opinions from multiple reviewers based on meta review, making them more holistic and objective. Experimental results show ReviewScore’s superior alignment with human judgments compared to existing metrics. Besides, using ReviewEval, we comprehensively re-evaluate peer review generation models and conduct detailed analysis, revealing several key insights. The ReviewEval benchmark and the toolkit for ReviewScore will be publicly released.
Loading