End-to-End Process Orchestration of Earth Observation Data Workflows with Apache Airflow on High Performance Computing
Abstract: Earth Observation (EO) data processing faces challenges due to large volumes, multiple sources, and diverse formats. To address this issue, this paper presents a scalable and parallelizable workflow using Apache Airflow, capable of integrating Machine Learning (ML) and Deep Learning (DL) models with Modular Supercomputing Architecture (MSA) systems. To test the workflow, we considered the production of large-scale Land-Cover (LC) maps as a case study. The workflow manager, Airflow, offers scalability, extensibility, and programmable task definition in Python. It allows us to execute different steps of the workflow in different High-Performance Computing (HPC) systems. The workflow is demonstrated on the Dynamical Exascale Entry Platform (DEEP) and Jülich Research on Exascale Cluster Architectures (JURECA) hosted at the Jülich Supercomputing Centre (JSC), a platform that incorporates heterogeneous JSC systems.
Loading