Keywords: Multi-modal Learning, Semantic Segmentation, Knowledge Distillation
Abstract: Simultaneously using multimodal inputs from multiple sensors to train segmentors is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where multimodal segmentors over-rely on certain modalities, causing performance drops when these modalities are missing—common in real-world applications. To this end, we develop the \textbf{first} framework for learning robust segmentor that can handling any combinations of visual modalities. Specifically, we first introduce a parallel multimodal learning strategy for learning a strong teacher. The cross-modal and unimodal distillation is then achieved in the multiscale representation space by transferring the feature-level knowledge from multimodal to anymodal segmentors, aiming at addressing the unimodal bias and avoiding over reliance on specific modalities. Moreover, a prediction-level modality-agnostic semantic distillation is proposed to achieve semantic knowledge transferring for segmentation. Extensive experiments on both synthetic and real world multi-sensor benchmarks demonstrate that our method achieves superior performance (+6.37% & +6.15%).
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13116
Loading