Integrating Multi-modal Contrastive Learning and Multi-scale Feature Extractor for Liver Cancer Classification
Abstract: The classification of liver cancer into primary and metastatic types is a crucial task in computer-aided diagnosis. However, the process of existing leading neural network-based methods is black-box and lacks explainability. In this paper, we propose a multi-modal alignment framework for liver cancer classification, which efficiently learns the relationship between specific textural and visual features. During the feature extraction stage, we introduce a multi-scale extractor network to obtain fine-grained visual features, alongside utilizing a pre-trained language model for the encoding of textual information, which is derived from specialized expertise. In the multi-modal alignment stage, the use of multi-modal contrastive learning—encompassing both contrastive loss and match loss—enables the framework to accurately classify liver cancer types while providing relevant medical explanations. Empirical evaluation of a curated dataset of CT images demonstrates that our framework outperforms state-of-the-art methods. We will release the source code after acceptance.
External IDs:dblp:conf/swarm/ChenCWXY25
Loading