Unlocking Post-hoc Dataset Inference with Synthetic Data

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, dataset inference, synthetic data
TL;DR: Our method enables LLM dataset inference without any i.i.d. held-out set by generating this set and performing post-hoc calibration.
Abstract: The remarkable capabilities of large language models stem from massive internet-scraped training datasets, often obtained without respecting data owners' intellectual property rights. Dataset Inference (DI) enables data owners to verify unauthorized data use by identifying whether a suspect dataset was used for training. However, current DI methods require private held-out data with a distribution that closely matches the compromised dataset. Such held-out data are rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set through two key contributions: (1) creating high-quality, diverse synthetic data via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
Submission Number: 50
Loading