Keywords: synthetic data, generative learning, conditional diffusion model.
Abstract: Synthetic data generated by foundation models has recently emerged as a promising resource for acquiring pre-trained knowledge and improving data efficiency, especially in scenarios where real data is limited. However, directly incorporating synthetic data often introduces distributional bias and can even result in model collapse. To address these challenges, we propose Conditional Augmentation with Synthetic Data (CASD), a framework guided by three core principles: (a) No harm and no bias: The use of synthetic data should neither degrade model performance nor introduce distributional bias; (b) Positive utility: Synthetic data should enhance model performance; (c) Broad adaptability: The approach should be applicable to synthetic data from diverse sources without requiring case-specific modifications. CASD leverages both real and synthetic data by conditioning on their source labels, treating them as related but distinct domains. This design enables the model to harness large-scale synthetic data to strengthen representation learning, while mitigating bias by focusing on the conditional distribution. During sampling, the model utilizes the enhanced representation function to extract useful information from synthetic data, but fixes the source label to the real domain, ensuring consistency with the target distribution. Experimental results demonstrate that CASD can effectively utilize synthetic data from various foundation models, consistently improving both the quality and diversity of generated images without inheriting distributional bias.
Primary Area: generative models
Submission Number: 11273
Loading