TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

Hengrui Zhang; Liancheng Fang; Qitian Wu; Philip S. Yu

TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Abstract: While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT's superiority in both unconditional tabular data generation and conditional missing data imputation tasks.

Lay Summary: AI models that are excellent at writing text often struggle to create or complete spreadsheet-style data because tables contain a mix of different data types, like numbers and text. These models also work in a fixed, step-by-step order, which is not ideal for tables where the column order can change. To address this, we developed a flexible new model called TabNAT. This hybrid system uses a special technique for generating numbers and a different one for categories, allowing it to handle the mixed data effectively. Because it doesn't rely on a fixed order, TabNAT can generate data for any column based on the others, making it perfect for filling in missing values. In extensive tests on various datasets, TabNAT proved to be significantly better at both creating new, realistic tables and completing existing ones than previous methods.

Link To Code: https://github.com/fangliancheng/TabNAT

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Tabular data generation, missing value imputation, autoregressive models

Submission Number: 10647

Loading