Abstract: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain FID 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10%). We also provide theoretical guarantees that SFBD learns the true data distribution. These results underscore the value of limited clean pretraining, or pretraining on similar datasets. Empirical studies further validate and enrich our findings.
Lay Summary: Modern image generation models - such as those behind AI art tools - are typically trained on massive collections of images. However, this practice raises important concerns: some of the training data may be copyrighted, and models risk memorizing and reproducing such content too closely. One proposed solution is to train models only on noisy (blurred or altered) versions of the images, ensuring the originals are never directly seen. Yet in practice, we show that learning from noisy data alone is extremely difficult - it requires an impractically large number of samples to be effective.
In this work, we focus on diffusion models and demonstrate that introducing even a small fraction of clean (original) data, just 4% or 10%, can make a substantial difference. We propose a method called Stochastic Forward–Backward Deconvolution (SFBD), which alternates between denoising noisy samples using the current model and then retraining the model with those denoised results. This process helps the model gradually learn to generate realistic images, even when most of the training data is noisy. Our experiments show that SFBD achieves image quality close to models trained on fully clean datasets, while greatly reducing legal and ethical risks. This work offers a promising path toward training generative models more responsibly and efficiently.
Link To Code: https://github.com/watml/SFBD
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: diffusion models, deconvolution, ambient diffusion
Submission Number: 2932
Loading