Analyzing Diffusion Models on Synthesizing Training Datasets

Analyzing Diffusion Models on Synthesizing Training Datasets

ICLR 2024 Workshop DMLR Submission15 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion models, synthetic dataset for machine learning, generative data augmentation

TL;DR: We found that the synthetic samples from diffusion models are less informative than real samples for training classifiers through extensive analysis.

Abstract: Synthetic samples from diffusion models are promising for training discriminative models as replications or augmentations of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets when comparing them on the same dataset size. This means that the synthetic samples from modern diffusion models are less informative for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the noising (diffusion) and denoising (reverse) process of diffusion models. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information produced by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic samples are concentrated in modes of the training data distribution as the reverse step increases, and thus, they have difficulty covering the outer edges of the distribution. In contrast, we found that these synthetic samples yield significant improvements in the data augmentation setting where both real and synthetic samples are used, indicating that the samples around modes are useful as interpolation for learning classification boundaries. These findings suggest that modern diffusion models are currently not sufficient to replicate the real training dataset in the same dataset size but are suitable for interpolating the real training samples as the augment datasets.

Primary Subject Area: Data collection and benchmarking techniques

Paper Type: Research paper: up to 8 pages

Participation Mode: In-person

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 15

Loading