Keywords: LLM, Paper revision, Benchmark
Abstract: The rise of Human-AI collaboration can effectively speed up the research process for experts and allow anyone with critical thinking skills to conduct innovative work. A key part of this collaboration is the AI's ability to improve a paper with human feedback—updating both the text and experiments to meet high standards. To evaluate this skill, we introduce ReviseBench, a benchmark built on real academic data, testing the skills of Large Language Models (LLMs) on paper interpretation, experimental implementation, and paper formulation, where authors' camera-ready versions naturally serve as human baselines for comparisons. To facilitate a fine-grained assessment, we further propose ReviseArena, a platform supporting pair-wise comparisons between different AI-revised papers. Our initial evaluation results on ReviseBench reveal that even state-of-the-art foundation LLMs struggle significantly in this domain, achieving a win rate of less than 10\% against human experts, and facing issues like incremental revision, unprofessional revision, and potential data fabrication.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: automatic evaluation, LLM/AI agents
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5388
Loading