Deduplicating Training Data Makes Language Models BetterDownload PDF


17 Aug 2021 (modified: 05 May 2023)ACL ARR 2021 August Blind SubmissionReaders: Everyone
Abstract: We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings.As a result, over $1\%$ of the unprompted output of language models trained on these datasets is copied verbatim from the training data.We develop two tools that allow us to deduplicate training datasets---for example removing from C4 a single 61 word English sentence that is repeated over $60{,}000$ times.Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy.We can also reduce train-test overlap, which affects over $4\%$ of the validation set of standard datasets, thus allowing for more accurate evaluation.
0 Replies