Model and Data Management for Machine Learning (M2ML): Integrating Instruments, Edge and HPC for Accelerated Machine Learning

Published: 2024, Last Modified: 25 Jan 2026IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The use of data produced by scientific instruments, such as the Advanced Photon Source Upgrade (APS-U), to train and fine-tune machine learning models is becoming increasingly challenging due to high data production rates, large data volumes, and the growing complexity of machine learning models. To address these challenges, researchers have developed frameworks like fairDMS to efficiently organize vast amounts of data and models for rapid querying when model degradation is detected. However, the complexity of these frameworks and the physically distributed nature of experimental facilities complicate their deployment.Here we introduce a high-performance model and data management framework for machine learning, M2ML. In contrast to previous frameworks, M2ML abstracts the tasks into three key elements that can be easily called and accessed by users. M2ML is capable of utilizing a variety of computational resources, that are distributed across scientific facilities, to accelerate machine learning tasks. For example, it can automatically transfer data from an experimental facility (such as APS-U) to a high performance computing (HPC) facility (such as the Argonne Leadership Computing Facility (ALCF)), train machine learning models at the HPC facility, and deploy the trained models on edge computing devices back at the experimental facility for inferencing. M2ML provides a unified interface for (on-the-fly) model (re)training, storage, evaluation, fine-tuning, and inferencing using heterogeneous resources that can be geographically distributed. M2ML uses Globus services such as Globus Transfer and Globus Compute (formerly FuncX). We evaluate M2ML using a high energy diffraction microscopy (HEDM) workflow that employs BraggNN to predict the diffraction peak locations. Results show that, although the BraggNN model is small, M2ML can significantly accelerate the workflow through selective assignment of tasks to different computing resources.
Loading