Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis

TMLR Paper5882 Authors

13 Sept 2025 (modified: 16 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have greatly improved the quality of synthetic text data. We aim to extend these advances to tabular data with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby represents differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. Pairing Tabby with Plain, our novel tabular training technique, we observe up to a $7\%$ improvement in quality (measured by MLE) over previous methods. Additionally, our approach is more flexible than prior strategies and extends beyond tables, to more general structured data. In a structured JSON setting, Tabby outperforms all other methods by $2$-$3$ points and is the only approach with MLE equal to the upper bound of non-synthetic data.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 5882
Loading