Mmy-net: a multimodal network exploiting image and patient metadata for simultaneous segmentation and diagnosis

Renshu Gu, Yueyu Zhang, Lisha Wang, Dechao Chen, Yaqi Wang, Ruiquan Ge, Zicheng Jiao, Juan Ye, Gangyong Jia, Linyan Wang

Published: 2024, Last Modified: 13 Nov 2024Multim. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Accurate medical image segmentation can effectively assist disease diagnosis and treatment. While neural networks were often applied to solve the segmentation problem in recent computer-aided diagnosis, the metadata of patients was usually neglected. Motivated by this, we propose a medical image segmentation and diagnosis framework that takes advantage of both the image and the patient’s metadata, such as gender and age. We present MMY-NET: a new multi-modal network for simultaneous tumor segmentation and diagnosis exploiting patient metadata. Our architecture consists of three parts: a visual encoder, a text encoder, and a decoder with a self-attention block. Specifically, we design a text preprocessing block to embed metadata effectively, and the image features and text embedding features are then fused on several layers between the two encoders. Moreover, Interlaced Sparse Self-Attention is added to the decoder to further boost the performance. We apply our algorithm on 1 private dataset (ZJU2), and 1 private dataset (LISHUI) for zero-shot validation. Results show that our algorithm combined with metadata outperforms its counterpart without metadata by a large margin for basal cell carcinoma segmentation (14.3\(\%\) improvement of IoU and 8.5\(\%\) of Dice on the ZJU2 dataset, and 7.1\(\%\) IoU on the LIZHUI validation dataset). Additionally, we applied MMY-Net to 1 public segmentation dataset to demonstrate its general segmentation capability. MMY-Net outperforms the state-of-the-art methods on the GlaS dataset.