SciReview: Diagnosing Compositional Scientific Reasoning in Frontier Models

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: memorization, generalization, deep generative models, autoregressive models, evaluation, benchmark, diagnostic framework, compositional reasoning, pattern matching, template matching, error detection
Abstract: Frontier large language models achieve high performance on many scientific evaluations, yet it remains unclear whether such performance reflects compositional reasoning or reliance on the conditional distribution of expert-genre prose. We introduce SciReview, a benchmark of expert-authored research-grade passages with naturally injected, locally plausible but scientifically consequential errors across science, engineering, technology, and mathematics. The construction protocol explicitly excludes bare lookup errors, definition rewrites that preserve internal coherence, and unsupported assertions: a filter that, by design, leaves only errors whose detection requires re-deriving claims from local content rather than retrieving facts. We pair this with an adversarial difficulty calibration in which each task contains errors stratified by the agreement of three frontier models, yielding a per-item probe of where the memorization--generalization boundary sits. Evaluating GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 under five recall metrics crossed with two false-positive treatments, we find that rankings invert sharply once false-positive control is enforced (GPT-5.4 falls from 62% to 10% Average Recall). This suggests that high recall in the unpenalized regime can mask reliance on surface pattern-matching rather than disciplined re-derivation. We plan on releasing this benchmark upon acceptance.
Submission Number: 108
Loading