Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: scientific reasoning, benchmarks, peer review, error detection, research reliability, scientific verification, evaluation methodology
Abstract: Large language models are increasingly positioned as AI scientists, yet existing evaluations focus on hypothesis generation, coding, experimentation, or reproducibility rather than on a capability that is central to scientific reliability: reviewer-style error detection, critique and feedback.
We introduce SciReview, a benchmark that evaluates whether frontier models can read a realistic research writeup and identify locally plausible, high-consequence conceptual errors.
Each task begins from an expert-authored research text; a domain expert then injects a small set of natural errors and provides gold rationales, while the protocol explicitly excludes trivial lookup mistakes, unsupported falsehoods, and items that break internal coherence.
Errors are calibrated into baseline and challenging difficulty tiers via adversarial filtering against multiple frontier models.
We evaluate frontier models GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 under five complementary scoring regimes and three qualitative axes (helpfulness, correctness, alignment).
No model achieves perfect error recovery on more than a single item; the strongest uncorrected recaller (GPT-5.4, 62% average recall) collapses to 10% once a single false positive disqualifies the run, while Gemini 3.1 Pro, rated highest by domain experts on overall quality, retains 45%.
SciReview complements recent AI-for-science benchmarks by measuring the capability that matters most for preventing researchers from acting on faulty premises before publication.
Submission Number: 76
Loading