Disjoint Generation of Synthetic Data

TMLR Paper7254 Authors

30 Jan 2026 (modified: 18 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing state-of-the-art performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Stefan_Feuerriegel1
Submission Number: 7254
Loading