Integrating Multi-modal Contrastive Learning and Multi-scale Feature Extractor for Liver Cancer Classification

Published: 2025, Last Modified: 23 Jan 2026ICSI (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The classification of liver cancer into primary and metastatic types is a crucial task in computer-aided diagnosis. However, the process of existing leading neural network-based methods is black-box and lacks explainability. In this paper, we propose a multi-modal alignment framework for liver cancer classification, which efficiently learns the relationship between specific textural and visual features. During the feature extraction stage, we introduce a multi-scale extractor network to obtain fine-grained visual features, alongside utilizing a pre-trained language model for the encoding of textual information, which is derived from specialized expertise. In the multi-modal alignment stage, the use of multi-modal contrastive learning—encompassing both contrastive loss and match loss—enables the framework to accurately classify liver cancer types while providing relevant medical explanations. Empirical evaluation of a curated dataset of CT images demonstrates that our framework outperforms state-of-the-art methods. We will release the source code after acceptance.
Loading