Keywords: Diffusion Models, Diversity, Data Augmentation.
Abstract: The scarcity of large and well-annotated datasets is a concern in medical image analysis, particularly for emerging applications without substantial public dataset releases. Data synthesis has become relevant to address this problem, as conditional generative models can provide extensive amounts of data. However, the diversity of these synthetic samples can be limited to their training distribution, which restricts the benefits of synthetic data for augmentation. This paper analyses this limitation in the context of medical image classification using two datasets: chest X-ray and strep pharyngitis detection in smartphone photos. Our findings reveal that the performance improvements when augmenting training datasets with generated samples can be inconsistent. Furthermore, in some cases, using a small number of strategically chosen synthetic samples can outperform a larger, randomly selected synthetic sets. This highlights the need for effective sampling strategies in conditional diffusion models to improve training diversity and enhance performance in downstream applications.
Submission Number: 94
Loading