How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

Yao Tong; Jiayuan Ye; Sajjad Zarifzadeh; Reza Shokri

How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

Yao Tong, Jiayuan Ye, Sajjad Zarifzadeh, Reza Shokri

Published: 06 Mar 2025, Last Modified: 30 Apr 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Learning, Privacy, Dataset Usage Inference, Dataset Ownership, Membership Inference Attack, Dataset Copyright

TL;DR: The first method to quantitatively and non-binarily answer the question ``How much has a dataset been used in the training of a given model?''

Abstract: How much of a given dataset was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized data usage and protecting their right (United States Code, 1976). However, previous work mistakenly treats this as a binary problem—inferring whether *all or none* or *any or none* of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (DUCI), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., $300 \times$ less) computational cost.

Submission Number: 40

Loading