Advancing Differential Privacy through Synthetic Dataset Alignment

Runkai Zheng; Yinong Oliver Wang; Nicholas Apostoloff; Oncel Tuzel; Jerremy Holland; Fernando De la Torre

Advancing Differential Privacy through Synthetic Dataset Alignment

Runkai Zheng, Yinong Oliver Wang, Nicholas Apostoloff, Oncel Tuzel, Jerremy Holland, Fernando De la Torre

27 Sept 2024 (modified: 20 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: synthtic data, differential privacy, foundation models

Abstract: Privacy in training data is crucial to protect sensitive personal information, prevent data misuse, and ensure compliance with legal regulations, all while maintaining trust and safeguarding individuals' rights in the development of ML models. Unfortunately, state-of-the-art methods that train ML models on image datasets with differential privacy constraints typically result in reduced accuracy due to noise. Alternatively, using synthetic data avoids the direct use of private data, preserving privacy, but suffers from domain discrepancies when compared to test data. This paper proposes a new methodology that combines both approaches by generating differentially private synthetic data closely aligned with the target domain, thereby improving the utility-privacy trade-off. Our approach begins with creating a synthetic base dataset using a class-conditional generative model. To address the domain gap between the synthetic dataset and the private dataset, we introduce the \textbf{Privacy-Aware Synthetic Dataset Alignment \text{(PASDA)}}, which leverages the feature statistics of the private dataset to guide the domain alignment process. PASDA produces a synthetic dataset that guarantees privacy while remaining highly functional for downstream training tasks. Building on this, we achieve state-of-the-art performance, surpassing the most competitive baseline by over 13\% on CIFAR-10. Furthermore, our $(1,10^{-5})$-DP synthetic data achieves model performance on par with or surpassing models trained on the original STL-10, ImageNette and CelebA dataset. With zero-shot generation, our method does not require resource-intensive retraining, offering a synthetic data generation solution that introduces \textbf{privacy} to a machine learning pipeline with both high \textbf{efficiency} and \textbf{efficacy}.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8956

Loading