Synthetic Data Generation Using Combinatorial Testing and Variational Autoencoder

Published: 01 Jan 2023, Last Modified: 06 Feb 2025ICSTW 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data is a crucial component in machine learning. However, many datasets contain sensitive information such as personally identifiable health and financial data. Access to these datasets must be restricted to avoid potential security concerns. Synthetic data generation addresses this problem by generating artificial data that are similar to, and thus could be used in place of, the original real-world data. This research introduces a synthetic data generation approach called CT-VAE that uses Combinatorial Testing (CT) and Variational Autoencoder (VAE). We first use VAE to learn the distribution of the real-world data and encode it in a latent, lower-dimensional space. Next, we use CT to sample the latent space by generating a t-way set of latent vectors, each of which represents a data point in the latent space. A synthetic dataset is generated from the t-way set by decoding each latent vector in the set. Our experimental evaluation suggests that machine learning models trained with synthetic datasets generated using our approach could achieve performance that is very similar to those trained with real-world datasets. Furthermore, our approach performs better than several state-of-the-art synthetic data generation approaches.
Loading