Keywords: Earth Observation, Hyperspectral Optical Imagery, Foundation Model
TL;DR: We propose a novel architectures to address the spatial and spectral attention of hyperspectral geospatial data efficiently.
Abstract: Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce our Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) architecture. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder paradigm, and evaluate the resulting performance on our constructed GFM-Bench, a comprehensive benchmark for such geospatial raster data. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.
Submission Number: 28
Loading