ESTJ: Efficient Semantic Segmentation via Token Joint Merging

Published: 2025, Last Modified: 25 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision Transformers (ViTs) leverage the attention mechanism for feature extraction but often suffer from high computational costs. To address this issue, prior works have introduced token reduction methods involving fixed-window local merging and global Bipartite Matching. However, these methods face significant challenges, such as insufficient merging due to fixed-size local windows and incorrect merging of informative tokens in global merging. To overcome these limitations, we propose Efficient Semantic Segmentation via Token Joint Merging (ESTJ) for ViT-based semantic segmentation networks. Specifically, ESTJ merges tokens using two strategies: Hierarchical Condition Pooling (HCP), which employs hierarchical local windows to effectively select sufficient tokens, and Protected Bipartite Matching (PBM), designed to preserve informative tokens using average similarity between a token and all other tokens. Experimental results demonstrate that ESTJ improves throughput by 75%, reduces GFLOPs by 40%, and enhances mIoU by up to 1.1%. Moreover, ESTJ can adjust the merging threshold during inference to adapt to scenarios that prioritize efficiency or accuracy. Compared to existing methods, ESTJ achieves a better balance between computational efficiency and segmentation accuracy.
Loading