Abstract: Self-supervised learning (SSL) through masked au-
toencoders (MAEs) has recently attracted great attention for
remote sensing (RS) foundation model (FM) development, en-
abling improved representation learning across diverse sensors
and downstream tasks. However, existing RS FMs often either
suffer from substantial computational complexity during both
training and inference or exhibit limited representational capac-
ity. These issues restrict their practical applicability in RS. To
address this limitation, we propose an adaptation for enhancing
the efficiency of RS FMs by integrating the Soft mixture-of-
experts (MoE) mechanism into the FM. The integration of Soft
MoEs into the FM allows modality-specific expert specialization
alongside shared cross-sensor representation learning. To demon-
strate the effectiveness of our adaptation, we apply it on the
Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in
the Cross-Sensor Mixture-of-Experts (CSMoE) model. In addi-
tion, we introduce a thematic-climatic descriptor-driven sampling
strategy for the construction of a representative and diverse
training set to train our CSMoE model. Extensive experiments
on scene classification, semantic segmentation, and content-based
image retrieval (CBIR) demonstrate that our adaptation yields
a reduction in computational requirements while maintaining or
improving representational performance. Compared to state-of-
the-art RS FMs, CSMoE achieves a superior trade-off between
representational capacity, accuracy, and computational efficiency.
On average, CSMoE achieves more than twice the computational
efficiency of existing RS FMs, while maintaining competitive
performance across all experiments. These results highlight the
effectiveness of the proposed adaptation for creating scalable and
computationally efficient RS FMs. The associated code for the
model and the training set creation, as well as the pretrained
model weights, will be available at https://git.tu-berlin.de/rsim/
csmoe.
Loading