Reliable or Risky? Assessing Diffusion Models for Biomedical Data Generation

Abdalrahman Alblwi; Qile Wang; Norbert Zolek; Matthew Louis Mauriello; Kenneth Barner

Reliable or Risky? Assessing Diffusion Models for Biomedical Data Generation

Abdalrahman Alblwi, Qile Wang, Norbert Zolek, Matthew Louis Mauriello, Kenneth Barner

Published: 12 Oct 2025, Last Modified: 13 Oct 2025GenAI4Health 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Generative Models, LLM, Denoising Diffusion, Synthetic Ultrasound Images, Inter-Rater Reliability

TL;DR: We evaluate the use of diffusion models to generate synthetic breast ultrasound images and tumor masks, highlighting their potential for data augmentation and the need for expert validation to ensure clinical reliability.

Abstract: Biomedical image datasets are often scarce, expensive to annotate, and vary in quality due to differences in imaging hardware. Generative models, particularly diffusion models, have recently demonstrated strong potential to synthesize realistic medical images, offering a promising strategy for data augmentation. Yet, their application in clinical contexts requires careful validation, as trust, interpretability, and reliability are essential when medical decisions are at stake. This work introduces a human-in-the-loop framework to assess the reliability and risks of diffusion models for generating breast ultrasound cancer images. Using a Denoising Diffusion Probabilistic Model (D-DDPM), we jointly generate ultrasound images and corresponding tumor masks from two benchmark datasets (BUS-BRA and UDIAT). The evaluation pipeline integrates quantitative image quality metrics (FID, IS, KID), radiologist interpretation, inter-rater agreement (Cohen’s/Fleiss’ Kappa, Krippendorff’s Alpha), and alignment with large language model (LLM) outputs. Results show that while D-DDPM can produce images that are visually similar to real data and sometimes yield higher agreement among experts than original images, inter-rater reliability remains weak, particularly for malignant tumors. Radiologists consistently outperform LLMs in classification, though majority voting across experts improves diagnostic accuracy. These findings highlight both the promise and risk of diffusion models in medical imaging, including that synthetic ultrasound data can supplement limited datasets, but robust expert validation remains indispensable to ensure clinical trustworthiness and safe integration.

Submission Number: 85

Loading