Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark

Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark

ACL ARR 2026 January Submission5388 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Paper revision, Benchmark

Abstract: The rise of Human-AI collaboration can effectively speed up the research process for experts and allow anyone with critical thinking skills to conduct innovative work. A key part of this collaboration is the AI's ability to improve a paper with human feedback—updating both the text and experiments to meet high standards. To evaluate this skill, we introduce ReviseBench, a benchmark built on real academic data, testing the skills of Large Language Models (LLMs) on paper interpretation, experimental implementation, and paper formulation, where authors' camera-ready versions naturally serve as human baselines for comparisons. To facilitate a fine-grained assessment, we further propose ReviseArena, a platform supporting pair-wise comparisons between different AI-revised papers. Our initial evaluation results on ReviseBench reveal that even state-of-the-art foundation LLMs struggle significantly in this domain, achieving a win rate of less than 10\% against human experts, and facing issues like incremental revision, unprofessional revision, and potential data fabrication.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: automatic evaluation, LLM/AI agents

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 5388

Loading