Abstract: Segmentation refinement attains high-quality performance by identifying and correcting error-prone points. Existing refinement algorithms primarily focus on querying error-prone points located around object boundaries and conducting per-pixel refinement through networks with a global receptive field, such as transformers. However, these algorithms neglect errors associated with complex features, including irregular shapes and diverse color variations. Additionally, extracting global correspondences may result in redundant correlation features, consequently increasing computational overhead. In this paper, we introduce P2Glocal-Transfiner, a novel pseudo progressive global to local transformer for semantic segmentation refinement. Our P2Glocal-Transfiner first employs a global to local error-prone selection strategy to comprehensively select error-prone points, thereby revealing more errors than those simply focusing on the object boundaries. Then, the P2Glocal-Transfiner incorporates a local transformer with a pseudo progressive global transformer to capture both short- and long-range correlations, facilitating refinement from coarse to fine. Moreover, our P2Glocal-Transfiner uses a convolution-based attention mechanism, achieving an optimal balance between performance and computational load compared to the vanilla transformer and CNN models. Extensive experiments on ADE20 K and Cityscapes demonstrate that our method yields significant refinement across various baseline models, including both convolution-based and transformer-based architectures.
External IDs:doi:10.1109/tetci.2025.3604819
Loading