Keywords: LLM training, open-weight, open-source, contamination, evaluation, academic pretraining
Abstract: Standardized benchmarks have become the dominant metric for measuring progress in large language models, yet their validity is increasingly compromised by data contamination and the unclear relationship between benchmark scores and genuine language understanding.
We introduce Gaperon, a suite of fully open bilingual (French-English) language models designed as an experimental testbed to investigate evaluation dynamics under realistic training conditions. Our study makes three core contributions.
First, we demonstrate mismatches between benchmark performance and generation quality: models that excel on benchmarks may underperform in qualitative text generation, and vice versa. Second, through our deliberately contaminated Gaperon-Garlic variant, we show that competitive benchmark scores can be recovered via late-stage contamination with only moderate degradation of generation quality, and surprisingly, such contamination also improves performance on held-out benchmarks.
Third, we provide empirical evidence that widely used neural quality filters, particularly those trained to favor instructional or educational content, amplify benchmark contamination in pretraining corpora, with the DCLM classifier systematically ranking benchmark samples in the top-5 percentiles of samples.
We release all models, data mixtures, checkpoints, and evaluation code to support reproducibility and further investigation.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: LLM training, open-source, open-weight, french, english, coding, contamination, evaluation, data filtering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: french, english
Submission Number: 6931
Loading