Hierarchical Contrastive Learning for Semantic Segmentation

Published: 01 Jan 2025, Last Modified: 13 Nov 2025IEEE Trans. Neural Networks Learn. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, pixel-to-pixel contrastive learning in single-scale feature space has been widely studied in semantic segmentation to learn a unified feature expression for pixels of the same category. However, the unified representation is too extreme, and the receptive field of each single-scale pixel is limited, which is insufficient to reflect the representative features of the category. To address these problems, this article extends the single-scale feature space to that of multiscale and proposes a hierarchical contrastive learning (Hi-CL) method to explore pixel-to-component semantic relationships. First, we generate multiscale candidate samples by applying several pooling windows with different sizes on a feature map, where different windows may represent different parts of the objects in the image. Then, we prune the sample set through threshold-based criteria to select appropriate samples for feature representation learning. Finally, Hi-CL is performed to learn the pixel-to-component consistency with the pruned samples. Our method is easy to be applied on existing semantic segmentation models and obtains consistent improvement. Furthermore, we achieve state-of-the-art results on three popular benchmarks, including Cityscapes, ADE20K, and COCO Stuff datasets.
Loading