Keywords: machine learning, privacy
Abstract: In machine learning, data curation is used to select the most valuable data for
improving both model accuracy and computational efficiency. Recently, curation
has also been explored as a solution for private machine learning: rather than
training directly on sensitive data, which is known to leak information through
model predictions, the private data is used only to guide the selection of useful
public data. The resulting model is then trained solely on curated public data.
It is tempting to assume that such a model is privacy-preserving because it has
never seen the private data. Yet, we show that without further protection curation
pipelines can still leak private information. Specifically, we introduce novel attacks
against popular curation methods, targeting every major step: the computation of
curation scores, the selection of the curated subset, and the final trained model.
We demonstrate that each stage reveals information about the private dataset,
and that even models trained exclusively on curated public data leak membership
information about the private data that guided curation. These findings highlight the
inherent privacy risks in data curation that were previously overlooked, and suggest
that (1) in the context of curation, privacy analysis must extend beyond the training
procedure to include the data selection process, and (2) true privacy-preserving
curation will require new methods with formal privacy guarantees.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2120
Loading