A Glitch of Large Language Model in Reviewing Academic Papers

ACL ARR 2025 February Submission2280 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have achieved great success across various areas. However, it remains an open question whether they are suitable for academic paper reviewing . We systematically examine whether LLMs can serve as paper reviewers through empirical study on papers from the International Conference on Learning Representations (ICLR) where we analyze the reviewing patterns of LLMs and identify some limitations. We find out that general-purpose LLMs struggle to generate well-structured reviews. However, when techniques such as Chain-of-Thought Prompting and Retrieval-Augmented Generation are used, LLMs demonstrate enhanced abilities in critical reasoning, improving their review quality. Additionally, supervised fine-tuning further refines their judgment, enabling more consistent decision-making in acceptance or rejection. While challenges remain, our results suggest that LLMs can work as an auxiliary reviewer.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: Retrieval Augmented Generation, Chain of Though, Supervised Fine-Tuning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2280
Loading