A Glitch of Large Language Model in Reviewing Academic Papers

A Glitch of Large Language Model in Reviewing Academic Papers

ACL ARR 2025 February Submission2280 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have achieved great success across various areas. However, it remains an open question whether they are suitable for academic paper reviewing . We systematically examine whether LLMs can serve as paper reviewers through empirical study on papers from the International Conference on Learning Representations (ICLR) where we analyze the reviewing patterns of LLMs and identify some limitations. We find out that general-purpose LLMs struggle to generate well-structured reviews. However, when techniques such as Chain-of-Thought Prompting and Retrieval-Augmented Generation are used, LLMs demonstrate enhanced abilities in critical reasoning, improving their review quality. Additionally, supervised fine-tuning further refines their judgment, enabling more consistent decision-making in acceptance or rejection. While challenges remain, our results suggest that LLMs can work as an auxiliary reviewer.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: Retrieval Augmented Generation, Chain of Though, Supervised Fine-Tuning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 2280

Loading