Keywords: multimodal, benchmark, LLM, peer review automation
Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming component of the research community. While Large Language Models (LLMs) are increasingly adopted to assist in generating review comments, current evaluations lack a unified benchmark to rigorously assess their ability to produce comprehensive, accurate, and human-aligned assessments, especially for multimodal content such as figures and tables. To address this gap, we propose **MMReview**, a multidisciplinary and multimodal benchmark encompassing 240 papers from 17 research domains across four major disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design 13 tasks grouped into four core categories, which evaluate LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial inputs. Extensive experiments on 18 open-source and 3 closed-source models validate the benchmark’s comprehensiveness. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 5447
Loading