EfficientFusion: simple and efficient learning with pixel-level fusion for semantic segmentation

Ping Liu; Shuaijie Tian; Yu Gao; Yuting Xie; Shufeng Hao

EfficientFusion: simple and efficient learning with pixel-level fusion for semantic segmentation

Ping Liu, Shuaijie Tian, Yu Gao, Yuting Xie, Shufeng Hao

Published: 01 Jan 2024, Last Modified: 20 May 2025Multim. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Semantic segmentation is a task that aims to help computers better understand images. The introduction of Vision Transformer has resulted in a shift from traditional CNN architectures to Transformer architectures for many downstream computer vision tasks, especially semantic segmentation tasks. However, the patch-based strategy in Vision Transformer still faces certain limitations. The first limitation is incoherent contextual information caused by the patch-based strategy. The second limitation is redundancy in the number of patches. To address these challenges, we propose a Pixel-Level Fusion block. This block enhances the contextual relationship between patches while merging redundant patches to reduce the overall number of patches with a similarity algorithm. On the COCO-Stuff10k[33] dataset, our method shows significant improvements compared to the state-of-the-art. Specifically, our method achieves a 19.4% increase in mIoU while also providing a 21.6% inference speed improvement on GPU.

Loading