Multimodal Weakly Supervised Segmentation for Histopathology

Published: 2025, Last Modified: 21 Jan 2026ISBI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Weakly-supervised tissue segmentation methods usually generate masks using Class Activation Maps (CAMs) based on image labels but often highlight only the most discriminative regions. Some approaches have attempted to improve the segmentation performance by incorporating textual knowledge from morphological descriptions. However, since detailed text annotations for each image are costly, such methods have used generic domain knowledge which can contain information irrelevant to the individual images. To address this, we propose a token-based method that selectively learns words most relevant to the segmentation objects while ignoring irrelevant text. We divide the image into patches and use CLIP's image-text matching capability to compute the similarity between each patch and all text labels, generating a coarse segmentation mask. In order to narrow the gap between patch-level and image-level, we further consider the correspondence between the segmentation object and text labels from a global perspective and incorporate it as part of the loss function. This method outperforms all benchmarked approaches on the LUAD-HistoSeg and BCSS-WSSS datasets.
Loading