Keywords: Tabular data generation
Abstract: Tabular data generation methods aim to synthesize artificial samples by learning the distribution of training data.
However, most existing tabular data generation methods are purely data-driven.
They perform poorly when the training samples are insufficient or when there exists a distribution shift between training and true data.
In many real-world scenarios, data owners are often able to provide additional knowledge beyond the raw data, such as domain-specific description or dependencies among features.
Motivated by this, we categorize the types of knowledge that can effectively support tabular data generation, and incorporate selected knowledge as auxiliary information to guide the generation process.
To this end, we propose KTGen, a $\textbf{K}$nowledge-enhanced $\textbf{T}$abular data $\textbf{Gen}$eration framework.
KTGen leverages auxiliary information by training a correction network in the latent space produced by a VAE, aligning the generated data with the auxiliary information.
Our experiments demonstrate that, when training on limited, biased data, incorporating auxiliary information makes the distribution of synthetic samples closer to the true data distribution, and also improves the performance of downstream models trained on the synthetic samples.
Primary Area: generative models
Submission Number: 24941
Loading