Data Contamination: From Memorization to ExploitationDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: It is common nowadays to train NLP models on massive web-based datasets. Previous works have shown that these datasets sometimes contain downstream test sets, a phenomenon typically referred to as "data contamination". It is not clear however to what extent models exploit the contaminated data for downstream tasks. In this paper we present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation.Our experiments with two models and three downstream tasks indicate that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show these two measures are affected by different factors such as contaminated data occurrences, model size, and random seeds. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.
0 Replies

Loading