Abstract: Recent advancements in medical image segmentation have leveraged multi-modal learning, incorporating textual descriptions to enhance segmentation accuracy. However, existing approaches suffer from high computational costs and inefficient text-vision fusion mechanisms, necessitating a more accurate yet computationally efficient solution. To address this, we propose ViTexNet, a novel vision-language segmentation model that introduces Text-Guided Dynamic Convolution (TGDC) for effective and lightweight fusion of medical visual features and textual cues. Unlike standard cross-attention mechanisms, which impose high parameter complexity, TGDC dynamically refines image features by leveraging relevant textual semantics at each decoder stage, ensuring efficient feature modulation without excessive overhead. By adaptively emphasizing clinically significant regions based on textual descriptions, TGDC enhances segmentation performance while maintaining computational efficiency. Extensive evaluations on QaTa-COV19 and MosMedData+ datasets demonstrate ViTexNet’s state-of-the-art performance, achieving 90.76% Dice and 83.25% mIoU on QaTa-COV19, and 78.19% Dice and 64.04% mIoU on MosMedData+, while operating at just 11.5G FLOPs, substantially lower than competing models. Ablation studies confirm TGDC’s superiority over cross-attention-based methods, highlighting its effectiveness in optimizing segmentation accuracy without computational trade-offs. The source code is made publicly available at: https://github.com/bhardwaj-rahul-rb/vitexnet.
External IDs:dblp:conf/miccai/BhardwajTN25
Loading