SYNTHIA: A Multi-Agent GAN-LLM Fusion for Statistically Guided Synthetic Data Generation

ICLR 2026 Conference Submission16146 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Data Generation, Multi-Agent Systems, Large Language Models, Generative Adversarial Networks
TL;DR: SYNTHIA is a GAN-inspired, multi-agent LLM framework for tabular data generation that leverages statistical feedback to iteratively refine prompts, achieving state-of-the-art fidelity and diversity over strong baselines.
Abstract: Access to high-quality, large-scale datasets is critical for training effective AI models, yet high costs, privacy concerns, and regulatory barriers often constrain data collection. Existing synthetic data generation methods, particularly for tabular data, struggle to preserve statistical integrity and utility, limiting their applicability in sensitive domains. To address this, we propose SYNTHetic Intelligence Architecture (SYNTHIA), a novel framework that integrates large language models (LLMs) as both the generator and discriminator within a GAN-inspired architecture for high-fidelity tabular data generation. Guided by metadata encodings, the LLM-based generator ensures that synthetic data reflects the statistical and structural properties of real datasets. A core innovation is the statistically enhanced discriminator, which incorporates a novel evaluation algorithm to rigorously quantify fidelity, diversity, and alignment with real data. This mechanism minimizes distributional divergence and accelerates convergence, ensuring realistic and utility-preserving synthetic data. Extensive experiments across diverse tabular datasets demonstrate that SYNTHIA consistently outperforms state-of-the-art methods, highlighting its scalability and adaptability for applications in data-constrained environments such as healthcare, finance, and security.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 16146
Loading