Abstract: Real-time semantic segmentation aims to balance networks’ segmentation accuracy and inference speed. Recently, dual-resolution network shows its efficiency and effectiveness in this field. However, due to self-attention’s O(N2) computational complexity, pure CNN-based networks still dominate in this architecture. In this paper, we propose RTLinearFormer, an efficient dual-resolution network with Transformer specially for real-time semantic segmentation, which achieves trade-off between segmentation accuracy and inference speed. To this end, our RTLinearFormer leverages lightweight channel-wise multi-scale attention with linear complexity, channel feature focusing and multi-scale features to extract global contextual information in low-resolution branch and uses cross-resolution ReLU-based linear attention with linear complexity to gather global contextual information derived from low-resolution branch with detailed information in high-resolution branch. Extensive experiments demonstrate the effectiveness and efficiency of our proposed RTLinearFormer in real-time semantic segmentation task. The model achieves 78.41% val mIoU and 66.7 FPS on Cityscapes using a single RTX 3090 GPU. When pretrained on Cityscapes, it achieves 77.4% test mIoU and 141.3 FPS on CamVid. It shows promising results on Cityscapes and CamVid, demonstrating superior performance over other state-of-the-art models. To further validate its generalization ability, we also test it on the ADE20K and COCOStuff datasets. It achieves 35.70% val mIoU and 107.3 FPS on ADE20K, and 31.15% test mIoU and 150.2 FPS on COCOStuff, setting new benchmarks for state-of-the-art performance.
Loading