Abstract: Semantic segmentation is a task that aims to help computers better understand images. The introduction of Vision Transformer has resulted in a shift from traditional CNN architectures to Transformer architectures for many downstream computer vision tasks, especially semantic segmentation tasks. However, the patch-based strategy in Vision Transformer still faces certain limitations. The first limitation is incoherent contextual information caused by the patch-based strategy. The second limitation is redundancy in the number of patches. To address these challenges, we propose a Pixel-Level Fusion block. This block enhances the contextual relationship between patches while merging redundant patches to reduce the overall number of patches with a similarity algorithm. On the COCO-Stuff10k[33] dataset, our method shows significant improvements compared to the state-of-the-art. Specifically, our method achieves a 19.4% increase in mIoU while also providing a 21.6% inference speed improvement on GPU.
Loading