Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning-Driven Synthesis, Benchmark Contamination
TL;DR: We empirically examine LLM benchmark contamination by synthesizing multi-step reasoning questions directly from temporally stratified research papers, finding stable performance across cutoffs unlike publicly sourced benchmarks.
Abstract: Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. By comparative analysis with other longitudinal studies using public questions and observed significant post-cutoff performance decay, we hypothesize that the multi-step reasoning required by our synthesis pipeline goes deeper than shallow pattern-matching that enables memorization, which effectively serves a mitigation strategy against benchmark contamination. We plan to fully open source our code and datasets after peer review to aid reproducibility and promote the adoption of contamination-resistant evaluation paradigms.
Submission Number: 125
Loading