M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and Multispectral Data

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: Earth Observation, Synthetic Aperture Radar, SAR, Remote Sensing, Computer Vision, Machine Learning, Dataset
TL;DR: A large-scale, ML-ready earth observation dataset and pipeline including additional SAR datatypes such as coherence and interferometry, in addition to RGB data and labelled tasks.
Abstract: Satellite-based remote sensing has revolutionised the way we address global challenges in a rapidly evolving world. Huge quantities of Earth Observation (EO) data are generated by satellite sensors daily, but processing these large datasets for use in ML pipelines is technically and computationally challenging. Specifically, different types of EO data are often hosted on a variety of platforms, with differing degrees of availability for Python preprocessing tools. In addition, spatial alignment across data sources and data tiling for easier handling can present significant technical hurdles for novice users. While some preprocessed Earth observation datasets exist, their content is often limited to optical or near-optical wavelength data, which is ineffective at night or in adverse weather conditions. Synthetic Aperture Radar (SAR), an active sensing technique based on microwave length radiation, offers a viable alternative. However, the application of machine learning to SAR has been limited due to a lack of ML-ready data and pipelines, particularly for the full diversity of SAR data, including polarimetry, coherence and interferometry. In this work, we introduce M3LEO, a multi-modal, multi-label Earth observation dataset that includes polarimetric, interferometric, and coherence SAR data derived from Sentinel-1, alongside multispectral Sentinel-2 imagery and a suite of auxiliary data describing terrain properties such as land use. M3LEO spans approximately 17M data chips, each measuring 4x4 km, across six diverse geographic regions. The dataset is complemented by a flexible PyTorch Lightning framework, with configuration management using Hydra, to accommodate its use across diverse ML applications in Earth observation. Additionally, we provide tools to process any dataset available on popular platforms such as Google Earth Engine for seamless integration with our framework. We show that the distribution shift in self-supervised embeddings is substantial across geographic regions, even when controlling for terrain properties. Data is available at huggingface.co/M3LEO, and code at github.com/spaceml-org/M3LEO.
Supplementary Material: zip
Submission Number: 1949
Loading