Using Synthetic Data for Data Augmentation to Improve Classification Accuracy

Yongchao Zhou; Hshmat Sahak; Jimmy Ba

Using Synthetic Data for Data Augmentation to Improve Classification Accuracy

Yongchao Zhou, Hshmat Sahak, Jimmy Ba

Published: 23 Jun 2023, Last Modified: 08 Jul 2023DeployableGenerativeAIEveryoneRevisions

Keywords: diffusion model, data augmentation, synthetic data, image classification, generative model, model inversion

TL;DR: We propose a new method to steer Stable Diffusion Model to generate data for downstream classifier training, achieving better performance than training a classifier on the real data.

Abstract: Obtaining high quality data for training classification models is challenging when sufficient data covering the real manifold is difficult to find in the wild. In this paper, we present Diffusion Inversion, a dataset-agnostic augmentation strategy for training classification models. Diffusion Inversion is a simple yet effective method that leverages the powerful pretrained Stable Diffusion model to generate synthetic datasets that ensure coverage of the original data manifold while also generating novel samples that extrapolate the training domain to allow for better generalization. We ensure data coverage by inverting each image in the original set to its condition vector in the latent space of Stable Diffusion. We ensure sample diversity by adding noise to the learned embeddings or performing interpolation in the latent space, and using the new vector as the conditioning signal. The method produces high-quality and diverse samples, consistently outperforming generic prompt-based steering methods and KNN retrieval baselines across a wide range of common and specialized datasets. Furthermore, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, and assess the reliability of the generated data in both supporting various neural architectures and enhancing few-shot learning performance.

Submission Number: 4

Loading