CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

ACL ARR 2025 February Submission2553 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid increase in paper submissions to top AI and ML venues in recent years, in tandem with the development of ever more capable LLMs, has fueled a surge of interest in leveraging these models to automate parts of the peer review process. A core component of the reviewer's task consists of providing specific critiques that directly assess the scientific claims a paper makes. While it is now relatively easy to automatically generate passable (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging---requiring expert-level domain knowledge, careful reading, and logical reasoning. Furthermore, resources supporting this goal are lacking. To remedy this, and to facilitate benchmarking of LLMs on these objectives, this paper introduces CLAIMCHECK, a dataset of NeurIPS 2023 and 2024 submissions and reviews, annotated by ML experts for weaknesses and the paper claims that they target. We benchmark GPT-4o on three claim-centric tasks supported by CLAIMCHECK and find that even this cutting-edge model exhibits significant weaknesses in these tasks.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, Language Modeling, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 2553

Loading