MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

ACL ARR 2026 January Submission5447 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal, benchmark, LLM, peer review automation

Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming component of the research community. While Large Language Models (LLMs) are increasingly adopted to assist in generating review comments, current evaluations lack a unified benchmark to rigorously assess their ability to produce comprehensive, accurate, and human-aligned assessments, especially for multimodal content such as figures and tables. To address this gap, we propose **MMReview**, a multidisciplinary and multimodal benchmark encompassing 240 papers from 17 research domains across four major disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design 13 tasks grouped into four core categories, which evaluate LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial inputs. Extensive experiments on 18 open-source and 3 closed-source models validate the benchmark’s comprehensiveness. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 5447

Loading