Keywords: multilayer perceptron, piecewise linear activation functions, nonparametric density estimation, unary classification, synthetic tabular data
Abstract: To generate synthetic tabular data for subsequent use in machine learning, it is usually proposed to use all sorts of autoencoders, based on the assumption that their ability to "reproduce" input data from points of a low-dimensional latent space automatically means "reproducing" the statistical and structural properties of the distribution of the original sample. No evidence is provided for the truth of this assumption. The article proposes a consistent data generation method based on the authors' approach to solving the unary classification problem by a fully connected neural network (multilayer perceptron) with piecewise linear activation functions. It is shown that the output of such a network is an adaptive histogram estimate of the distribution density specified on a compact set. Consistency conditions for nonparametric estimates of this type were obtained in [Devroy 1980, 1996]. Tabular data are synthesized by thinning random vectors uniformly distributed on a compact set in accordance with the obtained empirical distribution density. The results of the method are illustrated by model examples.
Submission Number: 13
Loading