Abstract: Earth observation (EO) spans a broad spectrum of modalities, including optical, radar, multispectral, and hyperspectral data, each capturing distinct environmental signals. However, current vision-language models in EO, particularly CLIP-based variants, remain confined to individual modalities, limiting generalization and scalability across diverse tasks. We present DOFA-CLIP (Dynamic-One-For-All CLIP), a unified vision-language foundation model that dynamically adapts to EO modalities with flexible spectral configurations through a single Transformer backbone. Our approach introduces three key contributions: 1) the construction of GeoLangBind-2M, a large-scale EO image-text dataset covering six heterogeneous modalities with rich natural language descriptions; 2) a novel training strategy called VECT (Vision-models Enhanced Contrastive Text-image pretraining), which enhances the spatial awareness of CLIP features with multiple vision foundation models; and 3) a Modality-aware Knowledge Agglomeration (MaKA) module that refines feature distillation with modality-specific awareness. DOFA-CLIP achieves state-of-the-art zero-shot performance across a wide range of EO benchmarks, including unseen modalities and a diverse number of input spectral bands. Together, these contributions establish a scalable foundation for multimodal EO understanding and open new avenues for integrating heterogeneous EO data with large language models. Code and datasets will be released. Code and datasets are publicly available.
External IDs:dblp:journals/corr/abs-2503-06312
Loading