Keywords: Synthetic data, Post-hoc refinement, Privacy-Fidelity tradeoff, Generative modeling
TL;DR: MAPS refines synthetic data via privacy filtering and importance weighting, improving fidelity and utility while ensuring 0-identifiability guarantees.
Abstract: Generating high-quality synthetic data with privacy protections remains a challenging ad-hoc process, requiring careful model design and training often tailored to the characteristics of a targeted dataset. We present MAPS, a model-agnostic post-hoc framework that improves synthetic data quality for any pre-trained generative model while ensuring sample-level privacy standards are met. Our two-stage approach first removes synthetic samples that violate privacy by being too close to real data, achieving 0-identifiability guarantees. Second, we employ importance weighting via a binary classifier to resample the remaining synthetic data according to estimated density ratios. We evaluate MAPS across two healthcare datasets (TCGA-metadata, GOSSIS-1-eICU-cardiovascular) and four generative models (TVAE, CTGAN, TabDDPM, DGD), demonstrating significant improvements in fidelity and utility while maintaining privacy. Notably, MAPS achieves substantial improvements in fidelity metrics, with 40 out of 48 statistical tests demonstrating significant improvements in marginal distributional measures and notable enhancements in correlation structure preservation and joint distribution similarity. For example, Joint Jensen-Shannon Distance reduced from ranges of 0.7888-0.8278 to 0.5434-0.5961 on TCGA-metadata and 0.6192-0.7902 to 0.3633-0.4503 on GOSSIS-1-eICU-cardiovascular. Utility improvements are equally impressive, with classification F1 scores improving from ranges of 0.0866-0.2400 to 0.3043-0.3848 on TCGA-metadata and 0.1287-0.2085 to 0.2104-0.2497 on GOSSIS-1-eICU-cardiovascular across different model-dataset combinations. The code of this project is available at https://github.com/yanlihub/MAPS.git.
Submission Number: 81
Loading