Abstract: In this work, we carry out a data archaeology
to infer books that are known to ChatGPT and
GPT-4 using a name cloze membership inference query. We find that OpenAI models have
memorized a wide collection of copyrighted
materials, and that the degree of memorization
is tied to the frequency with which passages
of those books appear on the web. The ability
of these models to memorize an unknown set
of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform
much better on memorized books than on nonmemorized books for downstream tasks. We
argue that this supports a case for open models
whose training data is known.
0 Replies
Loading