Differentially-private data synthetisation for efficient re-identification risk control

Tânia Carvalho; Nuno Moniz; Luis Antunes; Nitesh V. Chawla

Differentially-private data synthetisation for efficient re-identification risk control

Tânia Carvalho, Nuno Moniz, Luis Antunes, Nitesh V. Chawla

Published: 01 Jan 2025, Last Modified: 20 Jul 2025Mach. Learn. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, they all have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose \(\epsilon\)-PrivateSMOTE, a technique designed to protect against re-identification and linkage attacks, particularly addressing cases with a high re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how \(\epsilon\)-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

Loading