Multi-Modal Few-Shot Semantic Segmentation Based on Triple Attention Mechanism and Hierarchical Decoding Transformer

Junsong Leng, Zeyu Zhao, Chang Tian, Zhong Chen, Guoyou Wang, Xiaoxuan Liu

Published: 2026, Last Modified: 02 Mar 2026IEEE Trans. Circuits Syst. Video Technol. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The goal of Few-Shot Segmentation (FSS) is to segment images of novel categories using few labeled examples. However, FSS tasks face challenges such as over-segmentation and lack of generalization issues. This paper addresses these challenges by employing a triple attention mechanism (TAM) and a hierarchical decoding transformer (HDT). Specifically, TAM is proposed to enhance the model’s ability to focus on spatial regions within query features that are semantically relevant to the target category. The HDT module then aggregates the enhanced query features with the support features in a decoupled manner, generating dense features with pixel-level semantic relevance, which improves the segmentation ability on novel classes. Additionally, considering that class-level labels inside image can provide weak supervision for the segmentation task, this paper introduces a contrastive language image pretraining (CLIP) based model to enhance the segmentation performance. The Grad-CAM mechanism is utilized to convert the class logit scores from CLIP into localization heatmaps, effectively leveraging the text label information to provide prior localization cues for the model. Extensive experiments conducted on the PASCAL- $5^{i}$ and COCO- $20^{i}$ datasets demonstrate state-of-the-art performance. The experimental results validate the effectiveness of the proposed method, significantly improving the generalization and segmentation performance of few-shot semantic segmentation models on novel categories.
Loading