Abstract: Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.
Dataset: https://figshare.com/s/d5adf26c802527dd0f62
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Human-Centered NLP, Ethics, Bias, and Fairness, Sentiment Analysis,
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 6945
Loading