To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination

Manley Roberts; Himanshu Thakur; Christine Herlihy; Colin White; Samuel Dooley

To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination

Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley

Published: 16 Jan 2024, Last Modified: 15 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: contamination, memorization, llm, codeforces, project euler, datasets, benchmarks, training cutoff

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present the first thorough study of LLM data contamination on datasets released over time, showing that LLMs’ ability to solve coding problems changes dramatically as a function of metrics such as release date and GitHub popularity.

Abstract: Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmark in the age of LLMs which train on webscale data.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: datasets and benchmarks

Submission Number: 11

Loading