Keywords: differential privacy, synthetic data, DP-SGD, privacy-utility tradeoff, tabular data, healthcare, variational autoencoder, scaling laws
TL;DR: Differentially private synthetic data has a sharp viability threshold that we map across six datasets, showing strong privacy costs far less data than existing theory predicts.
Abstract: Differentially private synthetic data can enable data sharing without compromising individual privacy, but DP-SGD adds noise that can destroy utility when training data is scarce.
How much data is enough is poorly understood.
We characterise a sharp \emph{viability boundary}, a training set size below which DP models produce random-chance output and above which they approach non-private baselines.
Across six tabular datasets spanning healthcare, census and ecology domains, we find that the ratio $N/d$ (training samples per encoded dimension) consistently predicts this transition, with viability emerging between $N/d \approx 50$ and $300$.
The boundary is insensitive to model size.
The data cost of strong privacy is sublinear, with $\varepsilon = 1$ requiring only ${\sim}2.5\times$ more data than $\varepsilon = 10$, well below formal DP-ERM predictions.
A controlled dimension-reduction experiment confirms that $N/d$, not $N$ alone, drives viability.
These results give practitioners an actionable heuristic: check $N/d$ before investing in DP synthetic data generation, and prefer feature engineering over data collection when the ratio is too low.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 94
Loading