Limited Reference, Reliable Generation: Rule-Guided Tabular Data Generation with Dual-Granularity Filtering
Keywords: Tabular Data Generation, Low-Data Regimes
Abstract: Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world and high-quality tabular data is insufficient.
Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific datasets with scarce records.
While prompt-based LLMs offer flexibility without parameter tuning, they often generate distributionally drifted data with localized redundancy, leading to degradation in downstream task performance.
To overcome these issues, we propose \textit{\textbf{ReFine}}, a framework that (i) derives symbolic \emph{if–then} rules from interpretable models and embeds them into prompts to explicitly guide the generation process toward the domain-specific distribution, and (ii) applies a dual-granularity filtering that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce localized redundancy.
Extensive experiments on various regression and classification benchmarks demonstrate that \textit{ReFine} consistently outperforms state-of-the-art methods, achieving up to \textbf{0.36} absolute improvement in $R^2$ for regression and \textbf{7.50\%} relative improvement in $F_1$ for classification tasks.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17437
Loading