Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

Shengzhe Xu; Cho-Ting Lee; Mandar Sharma; Raquib Bin Yousuf; Nikhil Muralidhar; Naren Ramakrishnan

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

Shengzhe Xu, Cho-Ting Lee, Mandar Sharma, Raquib Bin Yousuf, Nikhil Muralidhar, Naren Ramakrishnan

Published: 04 Jul 2025, Last Modified: 04 Aug 2025KDD 2025 Workshop SKnow-LLM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Tabular Data Generation, Large Language Models, Fine Tuning

Abstract: Synthetic data generation is integral to ML pipelines, e.g., to augment training data, replace sensitive information, and even to power advanced platforms like DeepSeek. While LLMs fine-tuned for synthetic data generation are gaining traction, synthetic table generation --- a critical data type in business and science --- remains under-explored compared to text and image synthesis. This paper shows that LLMs, whether used as-is or after traditional fine-tuning, are inadequate for generating synthetic tables. Their autoregressive nature, combined with random order permutation during fine-tuning, hampers the modeling of functional dependencies and prevents capturing conditional mixtures of distributions essential for real-world constraints. We demonstrate that making LLMs permutation-aware can mitigate these issues.

Submission Number: 14

Loading