Abstract: Natural Language Processing (NLP) research is becoming increasingly focused on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic review of work using OpenAI's ChatGPT and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document how much data has been leaked to ChatGPT in the first year after the model's release. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, including unfair or missing baseline comparisons, reproducibility issues, and authors' lack of awareness of the data usage policy. Our work provides the first quantification of the ChatGPT data leakage problem.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Data analysis, Surveys
Languages Studied: English
0 Replies
Loading