Abstract: Self-supervised learning (SSL) has gained significant traction within the remote sensing community, with pretraining a foundation model on large-scale unlabeled datasets for the interpretation of remote sensing images (RSIs) emerging as a trending direction. This approach aims to supplant the conventional practice of loading ImageNet pretrained weights, offering a more versatile and potentially more effective solution for handling RSIs. Among SSL techniques, contrastive learning excels in extracting general representations in the field of remote sensing. However, its excessive focus on inter-instance discrimination hinders the effectiveness of pretraining due to the diverse and complex geographical information present in RSIs. Moreover, the typical two-variations-as-one-pair pattern may be suboptimal, particularly given the temporal information specific to RSIs. In this article, we propose a novel method called promoting intra-instance similarity (PIS) for short, which leverages the temporal information specific to RSIs and increases the intra-instance variations to expand the positive representation space. Additionally, by PIS within this space, our foundation models develop the ability to extract more general and instance-invariant features that prove beneficial for various downstream tasks. Experiments show that our PIS method achieves state-of-the-art (SOTA) performance on ten datasets across four downstream remote sensing tasks, demonstrating the generalizability and efficacy of the proposed method. Through our preliminary investigation into intra-instance characteristics, we believe there exists substantial potential in this aspect, holding considerable promise for further exploration. The codes are available on the website: https://github.com/ShawnAn-WHU/PIS.git .
Loading