Ensembled Bayesian tabular data generator

Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Gang Li, Wray Lindsay Buntine

Published: 2026, Last Modified: 18 Mar 2026Knowl. Inf. Syst. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Tabular data generation has seen renewed interest with the advent of generative adversarial networks (GAN)—a two part framework constituting generator and discriminator artificial neural network, where parameters are learned by optimizing a game theoretic objective function. Recently, it has been shown that one can use a Bayesian network as either a generator or a discriminator in the GAN framework, resulting in an algorithm known as GANBLR. A Bayesian network encodes causal relation among attributes and is characterized by structure and parameters. It has been shown that GANBLR gives state-of-the-art results for tabular data generation. However, the model has one limitation. It uses class attributes during model training. For example, a supervised Bayesian network is needed as a generator at training time. This makes GANBLR inapplicable for cases where we do not have access to class information. Addressing this shortcoming of GANBLR has been the main motivation of this work. In this work, we have proposed a new model of tabular data generation—masked ensemble tabular generator (MEG), which does not require class labels to generate tabular data. The proposed models rely on a novel strategy of using a collection of Bayesian networks as part of the generator and relies on masking operations to train the generator efficiently. It also uses a group-based similarity measure to adjust the number of samples generated from each Bayesian network in the collection. We perform extensive experiments on a variety of datasets and demonstrate that MEG not only outperforms baselines that do not have class information during training, such as CTGAN and TVAE, but also outperforms baselines that provide access to class information during training, such as TableGAN and CtabGAN methods. It has almost similar performance in terms of machine learning utility to GANBLR and of course is greatly advantaged by being truly unsupervised in nature. We highlight this by demonstrating its applicability to a clustering task. We also investigate the privacy-preserving capabilities of MEG and demonstrate its superior performance compared to other baselines.

External IDs:dblp:journals/kais/ZhangZZLB26