TabSDS: a Lightweight, Fully Non-Parametric, and Model Free Approach for Generating Synthetic Tabular Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: TabSDS is a lightweight, non-parametric alternative to deep generative models for tabular data, offering very competitive performance, significantly faster execution, and open-source implementations in R and Python.
Abstract: The development of deep generative models for tabular data is currently a very active research area in machine learning. These models, however, tend to be computationally heavy and require careful tuning of multiple model parameters. In this paper, we propose TabSDS - a lightweight, non-parametric, and model free alternative to tabular deep generative models which leverages rank and data shuffling transformations for generating synthetic data which closely approximates the joint probability distribution of the real data. We evaluate TabSDS against multiple baselines implemented in the Synthcity Python library across several datasets. TabSDS showed very competitive performance against all baselines (including TabDDPM - a strong baseline model for tabular data generation). Importantly, the execution time of TabSDS is orders of magnitude faster than the deep generative baselines, and also considerably faster than other computationally efficient baselines such as adversarial random forests.
Lay Summary: Creating synthetic versions of real-world data—like health records or financial information—is an important task in machine learning. These artificial datasets help researchers build and test models with lower privacy risks. However, most current methods for generating this kind of data rely on complex algorithms that require a lot of computing power and a lot of work to set up correctly. Here we propose a new method for generating synthetic tabular data (data organized in rows and columns, like a spreadsheet). It uses simple techniques like sorting and shuffling to recreate the patterns found in real data. Our evaluations on several datasets showed it performed just as well as the more complicated systems while running much faster. This makes it a practical and efficient tool for anyone who needs realistic synthetic data without the computational hassle.
Link To Code: https://github.com/echaibub/TabSDS
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Tabular synthetic data, non-parametric, model-free, lightweight
Submission Number: 4298
Loading