Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Clément Pierquin; Aurélien Bellet; Marc Tommasi; Matthieu Boussard

Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0

TL;DR: Releasing synthetic data and keeping its generation model hidden could lead to better privacy guarantees, as demonstrated in our study on linear regression.

Abstract: Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model's inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.

Lay Summary: When data holders generate synthetic data, they aim to protect individual privacy. Current methods typically involve training a model to replicate sensitive data while preserving privacy, and then releasing this model so users can generate an arbitrary amount of synthetic data. Empirical evidence suggests that keeping the model parameters hidden might offer better privacy protection than revealing them, but this has not been theoretically proven. To explore this, we studied the privacy guarantees of synthetic data generation in a simplified scenario: linear regression. Our analysis reveals that randomizing the inputs of the synthetic data generation process amplifies privacy protection, but if an adversary controls the inputs, such amplification disappears. We believe that our findings can serve as a foundation for the analysis of more complex and widely-used generative models, potentially strengthening privacy protection of synthetic data in practical scenarios.

Primary Area: Social Aspects->Privacy

Keywords: Differential Privacy, Synthetic Data

Submission Number: 11272

Loading