Addressing Data Scarcity in Materials Science Research with Deep Generative Models

Yilmaz Korkmaz; Deepti Hegde; Kerri-lee Chintersingh; Milad Alemohammad; Velat Kilic; Michael Flickinger; Amee L. Polk; Megan Bokhoor; Cole Peters; Rami Knio; Todd Hufnagel; Mark Foster; Timothy Weihs; Vishal M. Patel

Addressing Data Scarcity in Materials Science Research with Deep Generative Models

Yilmaz Korkmaz, Deepti Hegde, Kerri-lee Chintersingh, Milad Alemohammad, Velat Kilic, Michael Flickinger, Amee L. Polk, Megan Bokhoor, Cole Peters, Rami Knio, Todd Hufnagel, Mark Foster, Timothy Weihs, Vishal M. Patel

Published: 26 Jan 2026, Last Modified: 26 Jan 2026FoMoV OralEveryoneRevisionsCC BY 4.0

Keywords: material science, foundation models, synthetic data generation

Abstract: Developments in deep learning have facilitated the automatic visual analysis of scientific data, driving forward exploratory research. However, these approaches depend on large amounts of expert-annotated data for effective training, which is difficult to come by in narrow application domains. In this work, we address the challenges that come with performing visual analysis of high-speed x-ray phase contrast images of the combustion of molten metal particles. In this case, manual annotations of thousands of complex frames is highly impractical. To address this, we propose a synthetic data generation framework that eliminates the need for large-scale manual labelling by generating image-annotation pairs for the task of image segmentation. We first train a denoising diffusion model with a small number of annotated samples to generate image-binary mask pairs. We use the predictions of a fine-tuned segmentation foundation model to create a multi-class semantic annotations for the synthetic dataset. We apply our framework on x-ray phase contrast videos of particle combustion. From 200 manually annotated frames, we generate 10,000 synthetic image-annotation pairs. We demonstrate that training semantic segmentation models with our generated synthetic data yields significant boost in performance.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 6

Loading