Abstract: The recent advancements in cross-modal transformers have demonstrated their superior performance in RGB-D segmentation tasks by effectively integrating information from both RGB and depth modalities. However, existing methods often overlook the varying levels of informative content present in each modality, treating them equally and using models of the same architecture. This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. In our approach, we utilize a model trained exclusively on the RGB modality as the teacher, guiding the learning process of a student model that fuses both RGB and depth modalities.
To prioritize information from the primary RGB modality while leveraging the depth modality, we incorporate primary focused feature reconstruction and a selective alignment scheme. This integration enhances the overall freature fusion, resulting in improved segmentation results.
We evaluate our proposed method on the NYU Depth V2 and SUN-RGBD datasets, and the experimental results demonstrate the effectiveness of PrimKD. Specifically, our approach achieves mIoU scores of 57.8 and 52.5 on these two datasets, respectively, surpassing existing counterparts by 1.5 and 0.4 mIoU.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: We present a KD-based approach to guide multimodal fusion, with a specific focus on the primary modality. Unlike existing methods that often treat modalities equally without considering their varying levels of content, our findings and proposed method offer insights into effective multimodal processing.
Submission Number: 3102
Loading