PRNet: A Progressive Refinement Network for referring image segmentation

Published: 2025, Last Modified: 09 Nov 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The effective feature alignment between language and image is necessary for correctly inferring the location of reference instances in the referring image segmentation (RIS) task. Previous studies usually resort to assisting target localization with the help of external detectors or using a coarse-grained positional prior during multimodal feature fusion to implicitly enhance the modal alignment capability. However, these approaches are either limited by the performance of the external detector and the design of the matching algorithm, or ignore the fine-grained features in the reference information when using the coarse-grained prior processing, which may lead to inaccurate segmentation results. In this paper, we propose a new RIS network, Progressive Refinement Network (PRNet), which aims to gradually improve the alignment quality between language and image from coarse to fine. The core of the PRNet is the Progressive Refinement Localization Scheme (PRLS), which consists of a Coarse Positional Prior Module (CPPM) and a Refined Localization Module (RLM). The CPPM obtains rough prior positional information and corresponding semantic features by calculating the similarity matrix between sentence and image. The RLM fuses information from the visual and language modalities by densely aligning pixels with word features and utilizes the prior positional information generated by the CPPM to enhance the textual semantic understanding, thus guiding the model to perceive the position of the reference instance more accurately. Experimental results show that the proposed PRNet performs well on all three public datasets, RefCOCO, RefCOCO+, and RefCOCOg.
Loading