Abstract: Semantic segmentation methods enhance robust and reliable understanding under adverse illumination conditions by integrating complementary information from visible and thermal infrared (RGB-T) images. Existing methods primarily focus on designing various feature fusion modules between different modalities, overlooking that feature learning is the critical aspect of scene understanding. In this paper, we propose a novel module-free Multiplex Interactive Learning Network (MiLNet) for RGB-T semantic segmentation, which adeptly integrates multi-model, multi-modal, and multi-level feature learning, fully exploiting the potential of multiplex feature interaction. Specifically, robust knowledge is transferred from the vision foundation model to our task-specific model to enhance its segmentation performance. In the task-specific model, an asymmetric simulated learning strategy is introduced to facilitate mutual learning of geometric and semantic information between high- and low-level features across modalities. Additionally, an inverse hierarchical fusion strategy based on feature learning pairs is adopted and further refined using multilabel and multiscale supervision. Experimental results on the MFNet and PST900 datasets demonstrate that MiLNet outperforms state-of-the-art methods in terms of mIoU. As a limitation, the model’s performance under few-sample conditions could be improved further. The code and results of our method are available at https://github.com/Jinfu-pku/MiLNet.
Loading