Language-Guided Visual Prompt Compensation for Multi-Modal Remote Sensing Image Classification with Modality Absence
Abstract: Joint classification of multi-modal remote sensing images has achieved great success thanks to complementary advantages of multi-modal images. However, modality absence is a common dilemma in real world caused by imaging conditions, which leads to a breakdown of most classification methods that rely on complete modalities. Existing approaches either learn shared representations or train specific models for each absence case so that they commonly confront the difficulty of balancing the complementary advantages of the modalities and scalability of the absence case. In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity. It embeds missing modality-specific knowledge into visual prompts to guide the model in capturing complete modal information from available ones for classification. Specifically, a language-guided visual feature decoupling stage (LVFD-stage) is designed to extract shared and specific modal feature from multi-modal images, establishing a complementary representation model of complete modalities. Subsequently, an absence-aware visual prompt compensation stage (VPC-stage) is proposed to learn visual prompts containing missing modality-specific knowledge through cross-modal representation alignment, further guiding the complementary representation model to reconstruct modality-specific features for missing modalities from available ones based on the learned prompts. The proposed VPC-stage entails solely training visual prompts to perceive missing information without retraining the model, facilitating effective scalability to arbitrary modal missing scenarios. Systematic experiments conducted on three public datasets have validated the effectiveness of the proposed approach.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Joint classification of multimodal remote sensing images is an effective technique to fully integrate the complementary information of multiple modalities for land cover classification, which has become a hot research topic in the field of multimodal remote sensing image processing. However, in practical applications, some modal data may be missing due to sensor failures or inconsistent satellite revisit cycles, which brings challenges to multimodal joint classification. To this end, this work proposes a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence. The proposed LVPCnet combines language priors to embed missing modality-specific knowledge into visualization prompts to guide the model to capture complete modal information for classification from the available modality. This enables efficient handling of missing data and maintains superior classification performance, improving the practical ability of multimodal systems in various application scenarios.
Supplementary Material: zip
Submission Number: 4702
Loading