Abstract: Environmental data originates from diverse sources, posing chal-
lenges in management, processing, and visualization. This paper in-
troduces a scalable, AI-driven data pipeline framework for environ-
mental data management and discovery. The framework integrates
workflow orchestration, automated data ingestion and process-
ing, federated storage, and seamless geospatial visualization. It em-
ploys a Ceph-based storage system to handle large, heterogeneous
datasets, leveraging its fault-tolerant, distributed architecture for
high-performance storage across object, block, and file interfaces.
To enhance data discoverability and interoperability, the frame-
work incorporates Generative AI (GenAI) for automated metadata
generation, reducing manual annotation overhead while improving
real-time processing and cross-platform integration. Additionally,
the system enables interdisciplinary collaboration through stan-
dardized metadata structures and scalable data federation. A case
study using buoy data validates the framework’s capabilities, in-
cluding data processing, cleaning, and visualization. By addressing
critical data integration and accessibility challenges, the system
fosters a scalable, efficient, and intelligent research data-sharing
ecosystem for environmental science studies.
Loading