Limited Reference, Reliable Generation: Rule-Guided Tabular Data Generation with Dual-Granularity Filtering

Limited Reference, Reliable Generation: Rule-Guided Tabular Data Generation with Dual-Granularity Filtering

ICLR 2026 Conference Submission17437 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tabular Data Generation, Low-Data Regimes

Abstract: Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific datasets with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often generate distributionally drifted data with localized redundancy, leading to degradation in downstream task performance. To overcome these issues, we propose \textit{\textbf{ReFine}}, a framework that (i) derives symbolic \emph{if–then} rules from interpretable models and embeds them into prompts to explicitly guide the generation process toward the domain-specific distribution, and (ii) applies a dual-granularity filtering that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce localized redundancy. Extensive experiments on various regression and classification benchmarks demonstrate that \textit{ReFine} consistently outperforms state-of-the-art methods, achieving up to \textbf{0.36} absolute improvement in $R^2$ for regression and \textbf{7.50\%} relative improvement in $F_1$ for classification tasks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17437

Loading