Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization—where models inadvertently replicate exact or near-identical training data—has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation. Our code is available at https://github.com/fangzy96/TabCutMix.
Lay Summary: Generating synthetic tabular data, such as medical records or financial logs, is important when real data cannot be shared due to privacy concerns. Diffusion models have recently achieved strong results in generating realistic tabular data. However, while memorization, where models unintentionally copy training data, has been carefully studied in image and text generation, it remains largely unexamined in tabular data. We present the first comprehensive study of memorization in tabular diffusion models. We find that memorization does occur and becomes more severe with longer training. We also analyze how factors such as dataset size, number of features, and model design affect memorization. To address this issue, we propose a method called TabCutMix that mixes parts of features between samples from the same class. We further introduce TabCutMixPlus, which groups related features and exchanges them together to preserve the structure of the data better. Our findings reveal a hidden privacy risk in tabular data generation. The proposed methods reduce memorization while keeping the generated data useful. This makes diffusion-based tabular data generation safer and more suitable for applications in healthcare, finance, and other sensitive domains.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Memorization, Tabular Data, Diffusion Models
Submission Number: 8588
Loading