Programmable Synthetic Data Generation

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: synthetic data, tabular data, generative modelling
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper presents ProgSyn, the first programmable synthetic tabular data generation method.
Abstract: Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While generating synthetic data resembling the original distribution addresses some of these issues, most applications would benefit from additional customization on the generated data. However, existing synthetic data generation approaches are limited to particular constraints, e.g., differential privacy (DP) or fairness. In this work, we introduce ProgSyn, the first programmable and flexible synthetic tabular data generation framework. Customization is achieved via programmatically declared statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). To ensure high synthetic data quality in the presence of custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications using novel relaxations. We conduct an extensive experimental evaluation of ProgSyn over four datasets and on numerous custom specifications, where we outperform state-of-the-art specialized approaches on several tasks, while being more general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7367
Loading