Student Lead Author Indication: Yes
Keywords: synthetic data generation, out-of-distribution data generation, latent diffusion, variational auto-encoder
TL;DR: We present a new method to generate OOD tabular data using a latent diffusion model with distinct representations, creating boundary samples via latent manipulations to enhance robustness against distribution shifts and unseen examples.
Abstract: Many critical machine learning applications in cybersecurity, healthcare and finance, encounter challenges like data privacy, distribution shifts and class imbalance. Often, minority class labels are scarce and may only be present for specific types of samples, which can pose challenges for developing effective models that handle new and unforeseen minority examples at inference time. Additionally, feeding sensitive data into downstream models is a significant privacy concern. Synthetic data generation offers a potential solution by enabling data privacy, creating data samples to rebalance the classes and also provides a way to generate out-of-distribution samples. We introduce TabOOD, a novel approach that generates synthetic tabular data samples to enhance robustness against unseen data and distribution shifts. TabOOD generates out-of-distribution samples that could augment the training set, simulating unobserved scenarios and enhancing downstream model robustness. It also allows for the conditional generation of in-distribution minority and majority class samples. Building on recent advances in tabular data synthesis using latent diffusion models, our approach maps tabular data to class-dependent Gaussian mixture components in a latent space, thereby separating la- tent representations, before training diffusion models on the latent space. We further manipulate the latent space to generate atypical, boundary data points. Experimental results across different datasets demonstrate that TabOOD significantly improves the performance of downstream models when faced with distribution shifts or novel out-of-distribution samples, offering a more balanced and robust approach to tabular data learning.
Submission Number: 5
Loading