Keywords: memorization, generalization, deep generative models, autoregressive models, evaluation, benchmark, diagnostic framework, compositional reasoning, pattern matching, template matching, error detection
Abstract: Frontier large language models achieve high performance on many scientific evaluations, yet it
remains unclear whether such performance reflects compositional reasoning or
reliance on the conditional distribution of expert-genre prose. We introduce
SciReview, a benchmark of expert-authored research-grade passages
with naturally injected, locally plausible but scientifically consequential
errors across science, engineering, technology, and mathematics. The
construction protocol explicitly excludes bare lookup errors, definition
rewrites that preserve internal coherence, and unsupported assertions: a
filter that, by design, leaves only errors whose detection requires
re-deriving claims from local content rather than retrieving facts. We
pair this with an adversarial difficulty calibration in which each task
contains errors stratified by the agreement of three frontier models,
yielding a per-item probe of where the memorization--generalization
boundary sits. Evaluating GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6
under five recall metrics crossed with two false-positive treatments, we
find that rankings invert sharply once false-positive control is enforced
(GPT-5.4 falls from 62% to 10% Average Recall). This suggests that high recall in the unpenalized regime can
mask reliance on surface pattern-matching rather than disciplined
re-derivation. We plan on releasing this benchmark upon acceptance.
Submission Number: 108
Loading