Keywords: Self-supervised Learning, Colonoscopic Image Analysis, MultiView Learning, Joint Embedding Architectures, Side Information
TL;DR: Identifies a failure mode of the MultiView assumption in SSL, especially prevalent for medical domains such as colonoscopy, and proposes a framework using otherwise redundant side information.
Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for training visual encoders, and is of particular interest in medical imaging where acquiring large-scale labelled datasets is both costly and difficult. Colonoscopy is a prime example, playing a critical role in the early detection and classification of colorectal polyps. In this work, we explore the role of side information — images from the colon with no polyps — as an auxiliary signal alongside the positive context (images with polyps) during SSL pretraining. While traditional SSL methods rely on instance discrimination or contrastive similarity between augmented views, we hypothesize that leveraging non-polyp context can enhance the model’s ability to learn discriminative features for downstream tasks. To achieve this, we reformulate the MultiView assumption underlying many successful SSL frameworks such as DINO and Masked Siamese Networks, and operationalize this reformulation through a Jensen–Shannon divergence loss that explicitly disentangles nuisance from task-relevant features. Applied to colonoscopy pretraining, this approach yields improved performance on the clinically important task of distinguishing adenomatous (precancerous) from hyperplastic (benign) polyps. These findings highlight the value of leveraging structural cues in unlabelled data during SSL pretraining and suggest a promising direction for representation learning in colonoscopic imaging and beyond.
Submission Number: 20
Loading