Keywords: Synthetic Data, Privacy, Privacy in synthetic data, privacy in tabular data, privacy vs utility, privacy utility trade-off
TL;DR: We propose the PRISIM framework for generating synthetic tabular data of high fidelity and strong privacy.
Abstract: Data sharing in a collaborative environment is instrumental to propel innovation; however, privacy can pose a serious threat when sharing data as it comes with the risk of sensitive information leakage. On the other hand, analytical utility is another key factor to consider while sharing data to ensure its usability. Therefore, this research primarily focuses on the assessment and preservation of privacy and utility within centralized tabular data which is one of the most common types of data used across industries (e.g. HR, CRM, healthcare). The state-of-the-art (SOTA) centralized privacy preservation techniques, such as statistical anonymization (using generalization, binning, suppression, etc.) and differential privacy (DP) methods focus heavily on data privacy and ignore the analytical utility to a large extent. Hence, in this paper we propose a novel synthetic data generation-based approach with a statistical distance-based privacy-preserving mechanism (the framework is referred to as PRISM) to ensure analytically useful private synthetic data. %A new distance metric is also proposed by combining the Jaccard similarity index (JSI) and Mahalanobis distance (MD) to simulate a re-identification attack on mixed-type data. PRISIM is validated across five open-source data sets and compared against SOTA Differentially Private GANs and we observed on average $>20\%$ higher retention of utility while maintaining a similar level of privacy.
Supplementary Material: zip