A Scalable Framework for Heterogeneous Environmental Data Management Using Smart Data Pipeline

Published: 10 Jul 2025, Last Modified: 06 May 2026Practice and Experience in Advanced Research Computing (PEARC25)EveryoneCC BY 4.0
Abstract: Environmental data originates from diverse sources, posing chal- lenges in management, processing, and visualization. This paper in- troduces a scalable, AI-driven data pipeline framework for environ- mental data management and discovery. The framework integrates workflow orchestration, automated data ingestion and process- ing, federated storage, and seamless geospatial visualization. It em- ploys a Ceph-based storage system to handle large, heterogeneous datasets, leveraging its fault-tolerant, distributed architecture for high-performance storage across object, block, and file interfaces. To enhance data discoverability and interoperability, the frame- work incorporates Generative AI (GenAI) for automated metadata generation, reducing manual annotation overhead while improving real-time processing and cross-platform integration. Additionally, the system enables interdisciplinary collaboration through stan- dardized metadata structures and scalable data federation. A case study using buoy data validates the framework’s capabilities, in- cluding data processing, cleaning, and visualization. By addressing critical data integration and accessibility challenges, the system fosters a scalable, efficient, and intelligent research data-sharing ecosystem for environmental science studies.
Loading