ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

ACL ARR 2026 January Submission3459 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automated Peer Review, Multi-agent Systems, Large Language Models, Evaluation Benchmarks

Abstract: The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. Code is available at https://gitfront.io/r/anonymous-repo-acl/ecCqeCKyx8tM/ReviewGrounder-ACL-26-submission/.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: NLP Applications, Generation, Resources and Evaluation, Information Retrieval and Text Mining

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 3459

Loading