Abstract: Large language models (LLMs) have greatly improved the quality of synthetic text data. We aim to extend these advances to tabular data with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby represents differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. Pairing Tabby with Plain, our novel tabular training technique, we observe up to a $7\%$ improvement in quality (measured by MLE) over previous methods. Additionally, our approach is more flexible than prior strategies and extends beyond tables, to more general structured data. In a structured JSON setting, Tabby outperforms all other methods by $2$-$3$ points and is the only approach with MLE equal to the upper bound of non-synthetic data.
Certifications: J2C Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: De-anonymized camera-ready version. Rebuttal changes are no longer highlighted in blue; minor formatting changes and typos correction
Video: https://youtu.be/hprGB2IeAW4
Code: https://github.com/soCromp/tabby
Supplementary Material: zip
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 5882
Loading