Classifying Cervical OCT Images using Masked Autoencoders with VMamba

Qingbin Wang, Yuchen Pei, Jian Wang, Yutao Ma

Published: 2024, Last Modified: 23 Jan 2026BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Optical coherence tomography (OCT) emerged as a promising, non-invasive, real-time, high-resolution imaging technology for cervical cancer detection. The scarcity of high-quality annotated cervical OCT images severely hinders the predictive performance of deep learning models. In contrast, self-supervised learning (SSL) methods offer a viable solution to the above challenge. This study aims to develop a cost-effective SSL framework that efficiently classifies high-resolution cervical OCT images to meet the clinical "see-and-diagnose" requirement for cervical lesions. Therefore, we propose COVE, a novel SSL framework with masked autoencoders based on VMamba with linear complexity. To our knowledge, COVE is the first SSL framework using masked image modeling for VMamba to leverage large unlabeled cervical OCT datasets. It has two specific designs: (1) 2D-Selective-Scan for visible patches (VP-SS2D): the VMamba encoder’s core module processes only visible patches for efficient pre-training and addressing the inconsistency between pre-training and fine-tuning; (2) visible patch feature-preserving (VPFP): visible patch features in each decoder block are replaced with encoder-extracted features, decoupling the encoder’s feature extraction from the decoder’s pixel reconstruction tasks. Experimental results showed that COVE outperformed existing SSL frameworks in five-fold cross-validation on a 1,452-subject cervical OCT dataset from a multi-center study and two external validation sets from top Chinese hospitals. Additionally, COVE demonstrated the highest pre-training efficiency, with significantly faster speed than existing SSL frameworks.

External IDs:dblp:conf/bibm/WangPWM24