DSTG: Distillation Swin Transformer for Cross-View Geolocalization

Published: 2025, Last Modified: 29 Jan 2026IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In cross-view image geolocation tasks, traditional convolutional neural network (CNN) models are limited in performance due to their inability to effectively capture global correlations. Although Transformer-based methods can address this shortcoming, their computational complexity and GPU memory consumption are relatively high. To address these limitations, we propose Swin Transformer for cross-view geolocalization (STG), a novel model based on the Swin Transformer (ST) that leverages its advantage of linear computational complexity scaling with image resolution. We design an adaptive window shift (AWS) mechanism and introduce a pretraining strategy for STG, which can enhance STGs global modeling capability and its ability to learn general features. To further address the redundancy issues in our STG model, we propose the distillation STG (DSTG) model, which is obtained through knowledge distillation (KD) using the STG model as the teacher model. To tackle the issue that the student model struggles to learn from the teacher model’s output during distillation directly, we propose a multiscale logit standardization distillation method. The proposed STG and DSTG models can obtain superior global embedding descriptions without relying on polar coordinate transformations. Experimental results on the CVUSA, CVACT_val, and CVACT_test datasets indicate that the DSTG model has significantly lower computational cost and GPU memory usage compared with the STG model. At the same time, both models achieve state-of-the-art performance. The source code is available at https://github.com/liangjxiong/DSTG
Loading