Mending synthetic data with MAPS: Model Agnostic Post-hoc Synthetic Data Refinement Framework

ICLR 2026 Conference Submission19681 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Genertive modeling, Synthetic data, Post-hoc refinement, Privacy-Fidelity tradeoff
TL;DR: MAPS refines synthetic data via identifiability filtering and importance-weighted resampling, improving fidelity and utility while ensuring 0-identifiability guarantees.
Abstract: Generating high-quality synthetic data with privacy protections remains a challenging ad-hoc process, requiring careful model design and training often tailored to the characteristics of a targeted dataset. We present MAPS, a model-agnostic post-hoc framework that improves synthetic data quality for any pre-trained generative model while ensuring sample-level privacy standards are met. Our two-stage approach first removes synthetic samples that violate privacy by being too close to real data, achieving 0-identifiability guarantees. Second, we employ importance weighting via a binary classifier to resample the remaining synthetic data according to estimated density ratios. We evaluate MAPS across two healthcare datasets (TCGA-metadata, GOSSIS-1-eICU-cardiovascular) and four generative models (TVAE, CTGAN, TabDiffusion, DGD), demonstrating significant improvements in fidelity and utility while maintaining privacy. Notably, MAPS achieves substantial improvements in fidelity metrics, with 40 out of 48 statistical tests demonstrating significant improvements in marginal distributional measures and notable enhancements in correlation structure preservation and joint distribution similarity. For example, Joint Jensen-Shannon Distance reduced from ranges of 0.7888-0.8278 to 0.5434-0.5961 on TCGA-metadata and 0.6192-0.7902 to 0.3633-0.4503 on GOSSIS-1-eICU-cardiovascular. Utility improvements are equally impressive, with classification F1 scores improving from ranges of 0.0866-0.2400 to 0.3043-0.3848 on TCGA-metadata and 0.1287-0.2085 to 0.2104-0.2497 on GOSSIS-1-eICU-cardiovascular across different model-dataset combinations. Additionally, uncertainty quantification analysis via split conformal prediction demonstrates that MAPS considerably improves calibration quality, reducing average prediction set sizes by 55-77\% while maintaining target coverage on TCGA-metadata. The code of this project is available at https://anonymous.4open.science/r/MAPS-EBF8.
Primary Area: generative models
Submission Number: 19681
Loading