Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We train ~40 models for each 1B and 8B models on different contamination settings and compare their performance with a clean baseline on the machine translation task and quantify the impact of different types of contamination.
Abstract: Data contamination—the accidental consumption of evaluation examples within the pre-training data—can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
Lay Summary: Sometimes test examples end up in the training data making models look better than they really are. This paper studies how that “cheating” affects medium (1B) and large (8B) models. Starting from a clean train and test dataset, we deliberately reintroduced test examples in different formats and varying number of copies into the training set. This allowed us to compare the effect of seeing these test example during pre-training as opposed to not seeing them. We did this specifically for the machine translation task where we have a source sentence and want to translate it into a different language. We found that leaking source and translation together boosts BLEU scores(a measure of how good the translation is) by up to 30 points for the 8B model and around 8 BLEU scores for the 1B model. Leaking only one side gave smaller, inconsistent gains. We also showed that the frequency and recency of leaks can impact the amount of score inflation. We also show that effects can vary depending on how much data for a language exists in our training data which is called resource level of a language.
Primary Area: General Machine Learning->Evaluation
Keywords: Data contamination, large language models(LLMs), machine translation, LLM evaluation, performance overestimation
Submission Number: 12266
Loading