Lightweight Cross-text-vision Prompting Diffusion Network for Medical Image Segmentation

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Medical Image Segmentation, Robust Feature Learning, Multimodal Fusion, Cross-text-vision Prompt
TL;DR: We develop a lightweight cross-text-vision prompting diffusion network (LCPDN) to improve medical image segmentation accuracy and robustness.
Abstract: Accurate segmentation of anatomical and pathological structures is fundamental for reliable medical image analysis. Recently, UNet architectures have achieved remarkable performance in medical image segmentation. However, some challenges still exist: (1) Segmentation masks produced by UNet lack fine-grained details, decaying segmentation results; (2) UNet facilitates multi-scale feature fusion, yet the absence of explicit semantic prompts results in imprecise boundary predictions. To address these issues, we construct a lightweight cross-text-vision prompting diffusion network (LCPDN) to improve medical image segmentation accuracy and robustness. Specifically, we develop a cross-text-vision prompting feature learning (CPRFL) module that enables diffusion models to capture fine-grained representations guided by aligned visual and textual information. To further enhance the performance, a lightweight text-vision fusion representation (LTFR) module is modeled to effectively integrate visual features with diagnostic knowledge. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art (SOTA) performance with better generalization, particularly under low-data or noisy conditions, highlighting its potential for medical image segmentation tasks. The code is publicly available at https://anonymous.4open.science/r/segmentation.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10952
Loading