Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce a topology-aware self-supervised pre-training framework that exploits consistent anatomical spatial relationships across individuals and modalities.
Abstract: Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1\% and 5.94\% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.
Lay Summary: Self-supervised learning (SSL) aims to learn robust representations that serve as effective initializations for various downstream tasks, e.g., classification and segmentation. Previous SSL methods for medical imaging mainly exploit instance-level self-supervision by augmenting images or reconstructing masked patches. However, these strategies often overlook the topological consistency shared across individuals, limiting their ability to capture population-level anatomical representations. In this paper, we propose to leverage cross-instance anatomical topology as a supervisory signal for multi-modal medical image pre-training. Specifically, we design intra-instance and inter-instance alignment objectives to preserve local neighborhood relationships across modalities and individuals, enabling topology-aware representation learning under anatomical variability. Experiments on seven downstream tasks demonstrate consistent improvements in segmentation and classification, as well as enhanced robustness to missing modalities at test time.
Link To Code: https://github.com/Ashespt/TACO
Primary Area: Applications->Health / Medicine
Keywords: medical image analysis, self-supervised learning, 3D medical imaging
Originally Submitted PDF: pdf
Submission Number: 19869
Loading