Abstract: The rapid increase in paper submissions to top AI and ML venues in recent years, in tandem with the development of ever more capable LLMs, has fueled a surge of interest in leveraging these models to automate parts of the peer review process. A core component of the reviewer's task consists of providing specific critiques that directly assess the scientific claims a paper makes. While it is now relatively easy to automatically generate passable (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging---requiring expert-level domain knowledge, careful reading, and logical reasoning. Furthermore, resources supporting this goal are lacking. To remedy this, and to facilitate benchmarking of LLMs on these objectives, this paper introduces CLAIMCHECK, a dataset of NeurIPS 2023 and 2024 submissions and reviews, annotated by ML experts for weaknesses and the paper claims that they target. We benchmark GPT-4o on three claim-centric tasks supported by CLAIMCHECK and find that even this cutting-edge model exhibits significant weaknesses in these tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Language Modeling, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2553
Loading