A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

TMLR Paper6485 Authors

13 Nov 2025 (modified: 18 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by inadvertently reproducing exact training samples. While prior work focuses on data augmentation for memorization mitigation, little is known about which individual samples contribute the most to memorization. In this paper, we present the first data-centric study of memorization dynamics in tabular diffusion models. We begin by quantifying memorization for each real sample based on how many generated samples are flagged as its memorized replicas, using a relative distance ratio metric. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples disproportionately contributes to leakage, a finding further validated through sample-removal experiments. To better understand this effect, we divide real samples into the top- and non-top-memorized groups (tags) and analyze their training-time behavior differences. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC) across groups. We find that memorized samples tend to be memorized slightly earlier and show significantly stronger memorization signals in early training stages. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method. DynamicCut (a) ranks real samples by their epoch-wise memorization intensity, (b) prunes a tunable top fraction, and (c) retrains the model on the filtered dataset. Across multiple benchmark tabular datasets and tabular diffusion models, DynamicCut reduces memorization ratios with negligible impact on data diversity and downstream task performance, and complements existing data augmentation methods for further memorization mitigation. Furthermore, DynamicCut has transferability across different generative models for memorization sample tagging, i.e., high-ranked samples identified from one model (e.g., a diffusion model) are also effective in reducing memorization when removed from other generative models such as GANs and VAEs.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sanghamitra_Dutta2
Submission Number: 6485
Loading