Keywords: Geo-Foundation Model, Spatially Explicit Contrastive Learning, Implicit Neural Representation
Abstract: Vision Transformer (ViT) has been used in many computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we propose GAIR, a multimodal Geo-Foundation Model (GeoFM) integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and alternative training objectives (e.g., MoCo-V2 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts.
Supplementary Material:  zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 12798
Loading