Keywords: Data Augmentation, Data Privacy, Generative Adversarial Networks (GANs), Hybrid AI Models, Imbalanced Data, Large Language Models (LLMs), Machine Learning, Synthetic Data Generation, Tabular Data
Abstract: The growing need for privacy-preserving synthetic tabular data has led to the development
of generative models, particularly generative adversarial networks (GANs) such as CTGAN (Conditional
GAN) and Enhanced CTGAN. While these models have demonstrated success in tabular data synthesis,
they suffer from mode collapse, weak rare-category representation, and limited domain adaptability,
often requiring manual tuning for different datasets. Furthermore, GAN-based approaches lack contextual
awareness, making them ineffective at preserving logical feature relationships and real-world constraints.
This paper introduces SYN-TITAN (Synthetic Tabular Intelligence using Transformers and Adversarial
Networks), a hybrid LLM-GAN framework that integrates large language models (LLMs) with adversarial
learning to enhance data fidelity, privacy compliance, and scalability. LLMs assist in feature engineering,
data augmentation, and evaluation, ensuring that synthetic data maintains semantic integrity. SYN-TITAN
is benchmarked against CTGAN, Enhanced CTGAN, and other state-of-the-art synthetic data generators
using public datasets, demonstrating superior statistical alignment, rare-category preservation, and domain
adaptation. Our findings indicate that LLM-guided GAN training can significantly improve synthetic tabular
data quality, addressing key challenges in privacy-sensitive domains such as healthcare and finance. This
work provides a scalable and interpretable hybrid approach to synthetic data generation, paving the way for
more context-aware, adaptable, and reliable synthetic data frameworks.
Primary Area: generative models
Submission Number: 24298
Loading