Stable Diffusion for Imbalanced Detection Datasets: a Trustworthy Approach to Generate Guided Synthetic Biomedical Image Samples

Salvatore Capuozzo, Lidia Marassi

Published: 2025, Last Modified: 28 Feb 2026ICIAP (Workshops 2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The emergence of generative models has enabled the creation of synthetic data to augment existing datasets. Although this practice is generally considered safe, it can pose significant risks in high-stakes domains such as biomedicine. Artificial Intelligence (AI) systems trained on synthetic datasets, particularly those generated without rigorous safeguards, can inadvertently contribute to incorrect diagnoses or clinical decisions involving patients and animals. To mitigate such risks, the European Union (EU) has introduced the Ethics Guidelines for Trustworthy AI, emphasizing that trustworthy AI must be lawful, ethical, and robust. Consequently, the datasets used to train such models must also be reliable and well-validated. In this context, we propose a standardized framework for the generation and validation of synthetic datasets for object detection in the biomedical domain. The proposed methodology is structured into two primary pipelines: one dedicated to data generation using Stable Diffusion (SD) and harmonization models, and the other focused on validation through likelihood scoring, detection models, and structured checklists. Based on the results of experiments conducted within a microscopy-specific use case, our results support the effectiveness of this approach as a reliable solution to augment imbalanced datasets in accordance with EU regulatory principles.

External IDs:dblp:conf/iciap/CapuozzoM25