Keywords: Tabular data, Generative models, generative adversarial networks, Variational Autoencoders, Classifier two-sample test, Numerical features encoding
TL;DR: In this paper we emphasize the capability of strong classifiers like XGBoost to distinguish synthetic data from fresh real data, and propose a series of encoders which improve the performance of neural-network-based generative models
Abstract: If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators.