Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees
TL;DR: We propose a novel differentially private method for (1) linear regression with uncertainty quantification and (2) synthetic data generation that enables broader privacy-preserving analysis.
Abstract: In social sciences, where small- to medium-scale datasets are common, canonical tasks such as linear regression are ubiquitous. In privacy-aware settings, substantial work has been done on differentially private (DP) linear regression. However, most existing methods focus primarily on point estimation, with limited consideration of uncertainty quantification. At the same time, synthetic data generation (SDG) is gaining importance as a tool to allow replication studies in privacy-aware settings. Yet, current DP linear regression approaches do not readily support SDG. Moreover, mainstream SDG approaches are either designed for discretized data, which makes them less suitable for continuous regression tasks, or rely on deep learning models that require large datasets to train effectively. These limitations reduce their applicability to the smaller, continuous data regimes that are typical in social science research. To address these challenges, we propose a novel method for linear regression with valid inference under Gaussian DP. We derive a DP bias-corrected regression estimator and its asymptotic confidence interval. We also introduce a more general procedure for generating synthetic data, ensuring that running linear regression on the synthetic data is equivalent to the proposed DP linear regression. Our approach is designed to operate effectively in lower-dimensional settings. Experiments demonstrate that our method: (1) improves statistical accuracy on most datasets compared to existing DP methods for linear regression; (2) provides valid confidence intervals; and (3) yields more accurate synthetic data, as tested by downstream ML tasks, compared to existing DP SDG approaches.
Submission Number: 1820
Loading