ReviewArena: A Large-Scale Cross-Conference Dataset and Benchmark for LLM Peer Review

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Peer review, Dataset, Benchmark, Large language models, LLM-as-reviewer, OpenReview
TL;DR: ReviewArena: 51,529 papers, 196k reviews from 7 venues, plus ReviewArena-Eval (1,002 papers) for LLM reviewer benchmarking. LLMs show miscalibration, score compression, and weak accept/reject discrimination
Abstract: Peer review is central to quality control in machine learning (ML), but growing submission volumes have strained reviewer capacity and motivated interest in large language models (LLMs) as reviewers. Progress is hindered by the lack of datasets pairing full papers with structured, multi-dimensional reviews across venues which capture the full review–rebuttal–decision process. We introduce ReviewArena, a large-scale peer-review dataset constructed from all OpenReview venues with public reviews at the time of writing: NeurIPS, ICLR, ICML, CoRL, COLM, EMNLP, and TMLR. The dataset comprises 51,529 papers and 196,099 reviews across fourteen review fields, including full PDFs, reviewer scores and text, rebuttals, meta-reviews, and final decisions, with post-rebuttal revisions for NeurIPS 2025. To facilitate research, we derive ReviewArena-Eval, a 1,002-paper benchmark spanning the six conferences with aligned, venue-specific evaluation protocols. Baseline experiments with six open-weight LLMs using venue-aware prompts show that current models are miscalibrated, compress rating scales, and weakly distinguish accepted from rejected papers, while review text quality is only weakly coupled to numeric accuracy. ReviewArena is a unified resource for studying automated peer review, enabling research on review generation, scoring, calibration, and decision-making.
Submission Number: 284
Loading