HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: tabular synthetic data, LLMs, privacy, kNN, instruction fine-tuning
TL;DR: A novel framework for the generation of tabular data by LLMs and its evaluation.
Abstract: Data serves as the fundamental basis for advancing deep learning. The tabular data presented in a structured format is highly valuable for modeling and training. However, even in the era of LLM, obtaining tabular data from sensitive domains remains a challenge due to privacy or copyright concerns. Therefore, exploring the methods for effectively using models like LLMs to generate synthetic tabular data, which is privacy-preserving but similar to original one, is urgent. In this paper, we introduce a new framework HARMONIC for tabular data generation and evaluation by LLMs. In the data generation part of our framework, we employ fine-tuning to generate tabular data and enhance privacy rather than continued pre-training which is often used by previous small-scale LLM-based methods. In particular, we construct an instruction fine-tuning dataset based on the idea of the k-nearest neighbors algorithm to inspire LLMs to discover inter-row relationships. By such fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage. The experiments find that our tabular data generation achieves equivalent performance as existing methods but with better privacy by the metric of MLE, DCR, etc. In the evaluation part of our framework, we develop a specific privacy risk metric DLT for LLM synthetic data generation, which quantifies the extent to which the generator itself leaks data. We also developed LLE, a performance evaluation metric for downstream LLM tasks, which is more practical and credible than previous metrics. The experiments show that our data generation method outperform the previous methods in the metrics DLT and LLE.
Submission Number: 1026
Loading