TENet: A Text-Enhanced Network for Few-Shot Semantic Segmentation with Background-Aware Query Refinement

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: few-shot semantic segmentation, few-shot learning, multimodal model, CLIP, DeepSeek, background text aware
Abstract: Existing few-shot semantic segmentation (FSS) methods suffer from limited annotation data and domain gaps between support and query images. Although recent multi-modal approaches incorporate textual information to mitigate this gap, they primarily focus on visual features and foreground text, ignoring the value of background semantics. However, the background context plays a crucial role in reasoning. Its semantic association with the foreground helps the model to better distinguish the target. Motivated by this, we propose a Text Enhancement Network, called TENet, which is a novel FSS framework that uses both foreground and background text to generate high-quality activation maps for query features. The TENet adaptively generates background text from the foreground semantics by integrating a DeepSeek-based activation generation module. The background text is encoded using a CLIP encoder and fused with visual features to generate activation maps. To further improve alignment precision, we propose a joint optimization strategy by combining dynamic and fixed refinement methods. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that the TENet consistently outperforms state-of-the-art methods, validating the effectiveness of incorporating background text information and refined activation mechanisms in FSS.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11446
Loading