Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng; Marc Marone; Orion Weller; Dawn Lawrie; Daniel Khashabi; Benjamin Van Durme

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Data, Evaluation

Keywords: knowledge cutoffs, training data, temporal alignment

TL;DR: Singular knowledge cutoff dates do not capture the entirety of LLM training corpora, so we design a simple probing method using time spanning datasets and analyze a large set of open access pretraining corpora.

Abstract: Large Language Models (LLMs) are often paired with a reported cutoff date, the time at which training data was gathered. Such information is crucial for applications where the LLM must provide up-to-date information. However, a reported cutoff only scratches the surface. Do all sub-resources in the training data share the same cutoff? Does the model's demonstrated knowledge for these sub-resources closely align to their cutoff? We define the notion of an effective cutoff, which is distinct from the LLM's reported cutoff and differs between sub-resources. We propose a simple approach to estimate effective cutoffs of an LLM on the resource-level by probing across versions of the data. Crucially, our method does not require access to a model's pre-training data. Through our analysis, we find that effective cutoffs often drastically differ from reported cutoffs. To understand the root cause of this observation, we conduct a large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal misalignments of CommonCrawl data due to non-trivial amounts of old data in new dumps; and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use these models.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 289

Loading