MVITP: Multi-View Image-Text Perception for Few-Shot Remote Sensing Image Classification

Published: 2024, Last Modified: 14 Nov 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Few-shot learning has been extensively applied in current remote sensing image classification, enabling rapid identification of new classes by leveraging prior knowledge effectively. However, current methods mainly rely on image modality to address the issue of low intra-class similarity and high interclass similarity, while the utilization of multimodal methods in remote sensing tasks remains largely unexplored. Therefore, we propose a novel framework for few-shot remote sensing image classification, named multi-view image-text perception (MVITP). Specifically, it leverages maximum mutual information across multiple views to train an image encoder and generate image features. A text encoder is employed to generate text features. Next, we introduce a multimodal fusion encoder to capture the similarity between image features and text features. Finally, class predictions are further made by computing the similarity between the support set and the query set. We conduct experiments on three remote sensing datasets, demonstrating the outstanding performance of MVITP.
Loading