SuperCLIP: Semantic Attribute-Guided Transformer With Super-Resolution and CLIP for Zero-Shot Remote Sensing Scene Classification

Published: 01 Jan 2025, Last Modified: 01 May 2025IEEE Geosci. Remote. Sens. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-shot scene classification in remote sensing images presents considerable challenges, primarily due to the diverse variations in scene content and the inconsistent spatial resolutions, which complicate the classification of unseen scene categories. We propose SuperCLIP, a comprehensive framework that integrates a super-resolution module, contrastive language-image pretraining (CLIP), a semantic attribute-guided transformer (SAT), and a visual-semantic projection network (VSPN) to address these challenges. SuperCLIP leverages semantic attributes from three widely used remote sensing scene classification datasets to extract insightful semantic knowledge effectively through CLIP. We use the super-resolution module to obtain high-quality visual scenes of remote sensing images. The SAT enhances the transferability of visual features between seen and unseen categories by localizing object attributes, thereby improving the learning of distinct visual representations. These learned features are further mapped into a semantic embedding space using the VSPN, enabling stronger visual-semantic interactions for more accurate classification. Through extensive experiments, we demonstrate that the SuperCLIP framework significantly improves the classification performance of unseen scene categories across the three benchmark remote sensing datasets, highlighting its effectiveness. The code is available at https://github.com/ZSL-RSI-SC/SuperCLIP.
Loading