Group-wise Semantic-enhanced Interaction Network for Remote Sensing Spatio-Temporal Fusion

Published: 2025, Last Modified: 05 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing spatio-temporal fusion (STF) aims at fusing temporally-dense coarse-resolution images and temporally-sparse fine-resolution images to reconstruct high spatio-temporal resolution images. Multi-band remote sensing images are often accepted as inputs for STF that have complementary characteristics for high-fidelity land surface reconstruction, yet the existing STF framework often treats different bands uniformly without considering the statistical correlation between different bands, resulting in unsatisfying results. To address this problem, this paper presents a group-wise semantic-enhanced interactive network for STF, dubbed as GSINet. Based on statistical observations, the feature correlation between the visible-light group and the invisible-light group is weak, while the intra-group correlation is strong. Therefore, the GSINet first separates the inputs into visible-light and invisible-light groups, which are fed into different branches with independent encoders for feature extraction and fusion. Afterwards, to address the issue of land cover changes between the prediction coarse- and reference fine-resolution images, a Semantic-Enhancement Fusion Module (SEFM) is designed to interact with the features from the same group with enhanced semantic information captured in an unsupervised learning manner. Then, the semantic-enhanced fused features from different bands are fed into an Interleaved Cross-attention Module (ICM) for further fusion. Finally, the output fusion features fully encode the intra- and inter-group information, which are fed into the decoder, reconstructing the spatio-temporal high-resolution images. Extensive experiments on CIA and LGC benchmark datasets demonstrate that the GSINet outperforms a variety of state-of-the-art methods in terms of multiple metrics.
Loading