Abstract: Highlights•We propose Cross-Modal Transformer (CMFormer), a Transformer-based cross-modal semantic segmentation model, which achieves better cross-modal information interaction by capturing long-range contextual dependencies. The CMFormer includes the MS-CAC module and the GFA module, and achieves the state-of-the-art results in RGB-D semantic segmentation task.•We design the MS-CAC module and the GFA module by exploiting the cross-modal properties of Transformer. The MS-CAC module preserves multi-scale information while realizing channel feature correction, while the GFA module realizes spatial feature fusion, and fully considers global and local features, which make the CMFormer achieve better cross-modal information interaction by capturing long-range contextual dependencies.•The effectiveness of the CMFormer is evaluated with extensive experiments on the SOP dataset and NYU Depth v2 dataset. The results show that the CMFormer achieves the state-of-the-art results on both datasets. In the experiment of SOP dataset, CMFormer can achieve 96.74% MPA and 92.98% mIoU. At the same time, the real-time performance of CMFormer can reach 43 FPS(Frames Per Second), which has good real-time performance while meeting the requirements of high precision.
Loading