TL;DR: Our method enables LLM dataset inference without any i.i.d. held-out set by generating this set and performing post-hoc calibration.
Abstract: The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.
Lay Summary: **Problem:** Large language models (LLMs) are trained on massive amounts of text scraped from the internet, often without permission from authors and data curators. While existing dataset inference methods can detect if someone's writing was used to train these models, they require having similar unpublished text for comparison, which most data curators do not have saved up for legal purposes.
**Solution:** We developed a way to synthetically generate the comparison text. Our method works in two steps: first, we train a small AI system on published data to create synthetic text that matches the original style and topics. Then, we use a calibration technique that can distinguish between changes caused by our text generation versus actual signals that the original data was used in training.
**Impact:** Our approach successfully identified when specific datasets were used to train LLMs across diverse text types while avoiding false accusations against model providers. This gives writers, journalists, and other content creators a practical tool to prove their work was used without permission, potentially supporting copyright lawsuits and helping establish data ownership rights in the age of AI.
Link To Code: https://github.com/sprintml/PostHocDatasetInference
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, dataset inference, synthetic data
Submission Number: 2592
Loading