Abstract: The remote sensing image scene classification continues to face significant challenges due to high intraclass diversity and interclass similarity. Existing methods mainly use semantic associations between images to establish deep semantic associations between classes, ignoring the rich high-level semantic knowledge contained in the label text. This high-level information are especially valuable for distinguishing between confusing categories, as it enables the model to capture both similarity and distinctive features effectively. In this article, we introduce a novel approach that incorporates label semantic information and proposes a plug-and-play framework to guide classification model learning of intraclass and interclass relationships. Specifically, our framework includes a dynamic soft label module (DSLM), which uses textual semantics to facilitate classification model learning of interclass relationships via soft labels at the target level. In addition, we design a coarse-to-fine contrastive module (CFCM) to integrate textual semantics into contrastive learning, guiding the model in capturing intraclass and interclass relationships at the feature level. Our framework is compatible with both convolutional neural network (CNN)-based and vision transformer (ViT)-based classification architectures and is employed solely during training to minimize computational overhead. Experimental results on four datasets validate the effectiveness of our approach.
Loading