A Consistent Method for Generating Synthetic Tabular Data with a Fully Connected Neural Network

Published: 09 Mar 2025, Last Modified: 10 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilayer perceptron, piecewise linear activation functions, nonparametric density estimation, unary classification, synthetic tabular data
Abstract: To generate synthetic tabular data for subsequent use in machine learning, it is usually proposed to use all sorts of autoencoders, based on the assumption that their ability to "reproduce" input data from points of a low-dimensional latent space automatically means "reproducing" the statistical and structural properties of the distribution of the original sample. No evidence is provided for the truth of this assumption. The article proposes a consistent data generation method based on the authors' approach to solving the unary classification problem by a fully connected neural network (multilayer perceptron) with piecewise linear activation functions. It is shown that the output of such a network is an adaptive histogram estimate of the distribution density specified on a compact set. Consistency conditions for nonparametric estimates of this type were obtained in [Devroy 1980, 1996]. Tabular data are synthesized by thinning random vectors uniformly distributed on a compact set in accordance with the obtained empirical distribution density. The results of the method are illustrated by model examples.
Submission Number: 13
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview