\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{graphicx}
\usepackage{amsmath} % to get dummy images
\usepackage{hyperref}
\usepackage[table,xcdraw]{xcolor}
\usepackage{colortbl}
\setlength{\abovecaptionskip}{1pt}   % Adjust space above caption
\usepackage[skip=5pt]{caption}  % Reduce space above and below captions
\usepackage{soul}

\setlength{\belowcaptionskip}{1pt}    % Adjust space below caption
\usepackage{float}



\jmlryear{2025}
\jmlrworkshop{Full Paper -- MIDL 2025}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2025}

\title[3D Self-Supervised Learning for Medical Imaging]{Advancing Medical Image Segmentation with Self-Supervised Learning: A 3D Student-Teacher Approach for Cardiac and Neurological Imaging}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Moona Mazher\midljointauthortext{Corresponding Author}\nametag{$^{,1}$}} \orcid{0000-0003-4444-5776} \Email{m.mazher@ucl.ac.uk}\\
\Name{Daniel C. Alexander\midljointauthor\nametag{$^{1}$}} \Email{d.alexander@ucl.ac.uk}\\
\Name{Abdul Qayyum\midljointauthor\nametag{$^{2,3}$}} \Email{a.qayyum@imperial.ac.uk}\\
\Name{Steven A Niederer\midljointauthor\nametag{$^{2,3}$}} \Email{s.niederer@imperial.ac.uk}\\
\addr $^{1}$ Hawkes Institute, Department of Computer Science, University College London, London, United Kingdom \\
\addr $^{2}$ National Heart and Lung Institute, Faculty of Medicine, Imperial College London, London, United Kingdom \\
\addr $^{3}$ The Alan Turing Institute, London, United Kingdom \\
%\Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}\\
%\Name{Author Name3\nametag{$^{2}$}} \Email{alphabeta@example.edu}\\
%\Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} %\Email{uvw@foo.ac.uk}\\
%\addr $^{3}$ Address 3 \AND
%\Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
%\addr $^{4}$ Address 4
}

\begin{document}

\maketitle

\begin{abstract}


We propose 3D-SegSync, a self-supervised learning (SSL) framework designed to improve segmentation accuracy for both cardiac and neurological structures. It integrates a student-teacher model with a 3D Vision-LSTM (xLSTM) backbone to capture spatial dependencies in volumetric data. The SSL phase utilizes large-scale unlabeled datasets for pretraining, followed by fine-tuning on labeled data to improve segmentation across CT and MRI scans. Experimental results demonstrate that 3D-SegSync achieves consistent performance across different anatomical structures. Additionally, its ability to generalize between CT and MRI without requiring modality-specific modifications highlights its adaptability for cardiac and neurological image segmentation. Given its strong performance, 3D-SegSync has the potential to be extended to other medical image segmentation tasks in the future. Code can be found here: \href{https://github.com/Moona-Mazher/3D-SegSync$\_$SSL}{https://github.com/Moona-Mazher/3D-SegSync$\_$SSL}.



\end{abstract}

\begin{keywords}
Self-Supervised Learning (SSL), Whole Heart Segmentation (WHS), Ischemic Stroke Lesion Segmentation (ISLES), CT Imaging, MRI Imaging, Cardiac Imaging, Neurological Imaging, xLSTM, Multi-Modal Imaging, Traumatic Brain Injury (TBI).
\end{keywords}

\section{Introduction}
Medical image segmentation is critical for accurate diagnosis, treatment planning, and monitoring disease progression, especially in complex 3D tasks such as cardiac and neurological imaging. However, segmentation in these areas remains challenging due to factors like limited annotated data, modality variability, and suboptimal image quality. These difficulties are particularly evident in cardiac and brain imaging, where anatomical complexity, patient variability, and motion artifacts add complexity.

\textbf {Challenges in Cardiac and Neurological Imaging:} In cardiac imaging, accurate segmentation of structures like ventricles, atria, myocardium, and blood vessels is essential for diagnosing heart disease. However, the dynamic shape changes across the cardiac cycle, modality variability (CT vs. MRI), and motion artifacts make segmentation difficult. Existing methods, such as those by \cite{zhuang2016multi} and \cite{isensee2019}, often struggle with multi-center datasets and modality generalization.

In neurological imaging, accurate segmentation of ischemic stroke lesions is critical for prognosis and treatment. While MRI is commonly used, the variability in brain anatomy, lesion complexity, and imaging artifacts present substantial challenges. Models like \cite{menze2015} and \cite{kohl2020} have demonstrated robust performance but tend to rely on large annotated datasets and struggle with generalization across different clinical settings and modalities.

\textbf {Gaps in Existing Approaches:} (1) SSL methods like SimCLR \cite{chen2020} and MoCo \cite{he2020} have been successful in 2D tasks but fail to capture the long-range spatial dependencies and complex volumetric data of 3D medical images. Recent advancements have introduced 3D SSL models such as SwinMM \cite{wang2023swinmm}, SwinSSL \cite{tang2022self}, VoCo \cite{wu2024voco}, and Hi-End-MAE \cite{tang2025hi} to handle volumetric data of 3D medical images. However, these models still face limitations in fully adapting to the 3D nature of medical data, particularly in accurately modeling the intricate spatial relationships and improving segmentation performance across the entire volume of the image. These methods often focus on learning low-level features from local patches, but they may not explicitly model long-range dependencies between slices, which is crucial for accurate segmentation in 3D data. 
(2) Modality-Specific Limitations: Many existing models are optimized for specific imaging modalities (CT or MRI) and struggle to generalize across different modalities, leading to reduced performance in multi-modal settings \cite{ronneberger2015}; \cite{zhu2021}.
(3) Dependence on Labeled Data: Despite the promise of SSL, most methods still require substantial labeled datasets for fine-tuning, which remains a bottleneck in medical imaging due to the cost and time involved in manual annotation.

\textbf {Contribution:} Our study integrates 3D self-supervised learning (SSL) pretraining with xLSTM for medical image segmentation, particularly for large-scale CT and MRI datasets in cardiology and neurology. Inspired by the DINOv2 student-teacher framework, we extend its principles from 2D to 3D by replacing Vision Transformers (ViTs) with a 3D xLSTM-based encoder. While state-of-the-art SSL models like DINOv2, MAE, SwinMM, and SwinSSL rely on ViTs for feature learning, our approach leverages xLSTM to capture long-range spatial dependencies across slices, making it more effective for volumetric medical imaging. Pretrained on large unlabeled datasets and fine-tuned on smaller labeled ones, our model enhances segmentation accuracy and reliability in cardiology and neurology by leveraging SSL and xLSTM for improved feature representation and anatomical structure learning.

Our SSL-based framework helps overcome key challenges in cardiac and neurological imaging, including limited labeled data and cross-modality segmentation. By improving feature learning and spatial dependency modeling, our approach enhances segmentation performance and adaptability. This work has the potential to support clinical decision-making and improve patient outcomes in the future.

\section{Proposed Method}

\subsection{Dataset}

We curated and preprocessed whole heart CT/MRI and brain MRI datasets for self-supervised learning (SSL) and segmentation tasks. For whole heart segmentation, we used CT Coronary Angiography (CTCA) \cite{gharleghi2022automated} from the Coronary Atlas, ImageCAS (1,000 patients) \cite{zeng2023imagecas}, ImageTBAD (56 CT angiography images) \cite{radl2022avt}, and TotalSegmentator (1,204 CT scans) \cite{wasserthal2023totalsegmentator}. Unlabeled validation datasets (held by the challenge organizers to evaluate the participating teams' performance) from the MMWHS \cite{zhuang2019evaluation} and WHS++ \cite{zhuang2016multi} challenges were included for SSL pretraining, while labeled training sets from MMWHS, WHS++, and HVSMR-2.0 \cite{pace2024hvsmr} were used for fine-tuning. For brain imaging SSL pretraining, we leveraged ISLES datasets, including ISLES 2022 (400 MRI cases) \cite{de2024robust} and previous versions (ISLES 2015, 2016, 2018), along with ATLAS \cite{liew2022large} (304 cases in v1.2, 1,271 in v2.0). The model was fine-tuned on ISLES 2024 for stroke lesion segmentation and Traumatic Brain Injury (TBI) leision segmentation. The dataset distribution followed three phases: 1. pretraining on large, unlabeled (Cardiac: CT/MRI) (Brain: MRI) datasets for general feature learning, 2. fine-tuning with labeled datasets for the heart (HVSMR-2.0, MMWHS-CT, WHS++CT, MMWHS-MRI, WHS++MRI) and brain (ISLES 2024, TBI) segmentation These datasets were split into 80\% for training and 20\% for testing. 3. Finally, in the testing phase, we evaluated the model on the remaining 20\% of labeled data for both heart and brain segmentation tasks to assess its performance after pretraining and fine-tuning.



\subsection{Proposed Framework for SSL Pretraining and Fine-tuning}
\figureref{fig:Picture4} presents the overall workflow of the proposed model for whole heart and brain lesion segmentation. The framework comprises three primary sections:

\subsubsection{PROPOSED 3D SSL STUDENT-TEACHER MODEL}

A 3D student-teacher model, inspired by the 2D DINOv2 \cite{oquab2023dinov2} framework, is built on the xLSTM-UNet architecture for the SSL phase. We pretrained separate SSL models for cardiac and brain images, fine-tuning them for segmentation tasks. The xLSTM component captures long-range slice dependencies, ensuring spatial coherence, crucial for accurate segmentation. Unlike a Vision Transformer (ViT), our model uses xLSTM to model slice-to-slice relationships in 3D medical images, improving segmentation for cardiac and brain tasks. The student encoder is optimized via backpropagation, while the teacher encoder updates using a momentum-based EMA (Exponential Moving Average)  mechanism. Contrastive self-distillation helps the student match the teacher’s representations, and a hybrid loss function KL divergence and Mean Squared Error (MSE) enhances feature learning. Detailed methodology is provided in \appendixautorefname{C}.

\subsubsection{The xLSTM Module}

The xLSTM block integrates convolutional processing with a modified LSTM (mLSTM) for enhanced feature extraction and sequential modeling. It starts with a convolutional layer, instance normalization (IN), and Leaky ReLU activation to capture spatial patterns and stabilize training. The output is flattened, normalized, and split into two pathways: one undergoes a linear transformation with SiLU activation, while the other undergoes a flip operation before mLSTM processing to capture long-range dependencies. The pathways are merged, followed by a final linear transformation and a residual connection to preserve information and improve gradient flow. By combining convolutional and recurrent architectures, xLSTM extracts local spatial features while efficiently modeling sequential dependencies. The flip mechanism enables bidirectional processing, ensuring both past and future dependencies are captured, while normalization and residual connections enhance stability and training efficiency.


\subsubsection{Supervised Fine-Tuning for Segmentation}

In this stage, the pre-trained SSL models on the cardiac and neurological images were fine-tuned using a limited amount of labeled data for the respective segmentation tasks, including whole heart and stroke and traumatic brain injury lesion segmentation. During this phase, the pre-trained student encoder is fine-tuned to optimize segmentation performance for specific applications. 

\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{Picture6.png}
    \caption{Overview of the 3D SSL student-teacher pretraining framework and downstream fine-tuning segmentation pipeline, incorporating the xLSTM module.}
    \label{fig:Picture4}
\end{figure}

\subsection{Evaluation and Performance Analysis}	

For the rigorous evaluation of the 3D-SegSync performance through comprehensive analysis, the results were benchmarked against its variant (3D-SegSync\_Bottom: only bottom layer of the pretrained SSL encoder was updated) and other state-of-the-art (SOTA) models (3D-nnUNet 3D-nnUNet \cite{isensee2021nnu}, 3D-UNet \cite{ronneberger2015u}, 3D-ResUNet \cite{li2023state}), demonstrating the superior accuracy and robustness of the proposed approach in both heart and brain segmentation tasks. We further extended our comparison to the latest SSL methods that have been specifically developed for 3D medical imaging (SwinMM, SwinSSL, Voco, and Hi-End-MAE) to provide a more comprehensive evaluation of the benefits of our pretraining approach (see Table \ref{3dssl})





\subsection{Training and optimization}

We developed a self-supervised learning (SSL) framework in PyTorch for segmentation tasks, optimized using the Adam optimizer (LR: 0.00001) for stable convergence. Our model is trained with a batch size of 2 and a patch size of 96×96×96 during SSL and 128×128×128 during downstream tasks, for 1000 epochs. SSL includes data augmentation techniques such as random cropping, flipping, color jittering, Gaussian blur, and solarization, with a loss function using KL divergence and MSE. For downstream tasks, augmentations include flipping, scaling, noise addition, brightness/contrast adjustments, and RandGaussianRotate, RandGaussianSmooth, RandZoomd, RandAdjustContrast, RandGaussianNoise, RandShiftIntensity, and RandCrop. The downstream loss function combines cross-entropy and Dice loss to improve segmentation. A sliding window approach ensures smooth predictions during inference. Training took 15 hours for SSL (with early stopping at 20 epochs) and 24 hours for downstream tasks, using an A6000 GPU with 48 GB of memory.






\section{Results} 
We evaluated the performance of our proposed 3D SegSync model on multiple datasets, including three whole-heart (MMWHS, WHS++, HVSMR-2.0 \cite{pace2024hvsmr}), and two neurological/brain imaging (ISLES-2024 stroke and TBI). Results from \figureref{fig:hd_dice} consistently demonstrate that 3D-SegSync outperforms state-of-the-art (SOTA) models, achieving superior Dice scores and lower Hausdorff Distance 95$\%$ (HD95) values in all cardiac imaging datasets including CT and MRI modalities. It highlights that 3D-SegSync utilized multi-layer SSL pre-training and achieved significantly higher Dice scores and lower HD95 values compared to 3D-SegSync$\_$bottom, which only uses SSL features from the bottom layer. This multi-layer feature extraction allows 3D-SegSync to capture richer, hierarchical representations, leading to superior segmentation performance. 

In \figureref{fig:all_labels_HMSV}, we present detailed segmentation results on the HVSMR-2.0 dataset to showcase 3D-SegSync's performance across all labels. The model excels in segmenting anatomical structures like the left ventricle (LV), aorta (AO), and pulmonary artery (PA), achieving superior Dice scores, even for smaller structures. This highlights the model’s ability to balance local detail with broader anatomical context. Further analysis of 3D-SegSync’s generalization across imaging modalities and significance maps is provided in \appendixautorefname{B}. This improvement stems from advanced multi-layer SSL pre-training, enabling 3D-SegSync to learn richer feature representations. Unlike 3D-SegSync-bottom, which relies on low-level features, the full 3D-SegSync model integrates high-level context for enhanced segmentation accuracy and captures fine anatomical boundaries, as indicated by significantly lower HD95 values. Figure \ref{fig:brain_plots} compares the performance of 3D-SegSync with state-of-the-art models (3D-SegSync-bottom, xLSTM-UNet, 3D-nnUNet, 3D-ResUNet, 3D-UNet) on the ISLES2024 stroke and TBI datasets. 3D-SegSync outperforms all models, achieving the highest Dice scores and lowest HD95 values.


\begin{figure}[H]
    \centering
    \includegraphics[width=0.86\textwidth]{hd_dice.png}
    \caption{Performance comparison of the proposed 3D-SegSync and SOTA models on Dice and HD-95 metrics across Whole Heart segmentation datasets.}
    \label{fig:hd_dice}
\end{figure}


\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{all_labels_HMSV.png}
    \caption{Dice coefficient per label for each model for performance analysis of the proposed model with SOTA approaches on the HVSMR dataset. The labels include LV, RV, and other anatomical structures.}
    \label{fig:all_labels_HMSV}
\end{figure}






 \begin{figure}[H]
    \centering
    \includegraphics[width=1.1\textwidth]{brain_plots.png}
    \caption{Performance comparison of the proposed 3D-SegSync and SOTA models on Dice and HD-95 metrics across TBI and ISLES brain lesion segmentation datasets.}
    \label{fig:brain_plots}
\end{figure}




\begin{table}[H]
\centering
\caption{Performance analysis of 3D-SegSync with its variants and SOTA models for all heart and brain imaging datasets.}
\label{overall}
\resizebox{\textwidth}{!}{
\begin{tabular}{|lcccccc|}
\hline
\multicolumn{7}{|c|}{\textbf{Average Dice Coefficient (±SD)}}                                                                                                                                                                                                                                                                              \\ \hline
\multicolumn{1}{|l|}{\textbf{Dataset}} & \multicolumn{1}{c|}{\textbf{3D-SegSync}}                                                                  & \multicolumn{1}{c|}{\textbf{3D-SegSync-Bottom}} & \multicolumn{1}{c|}{\textbf{xLSTM-UNet}} & \multicolumn{1}{c|}{\textbf{3D-muUNet}} & \multicolumn{1}{c|}{\textbf{3D-ResUNet}} & \textbf{3D-UNet} \\ \hline
\multicolumn{1}{|l|}{HVSMR-2.0}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}0.77 ± 0.02\end{tabular}}} & \multicolumn{1}{c|}{0.75 ± 0.03}                & \multicolumn{1}{c|}{0.76 ± 0.04}         & \multicolumn{1}{c|}{0.74 ± 0.05}        & \multicolumn{1}{c|}{0.70 ± 0.05}         & 0.67 ± 0.06      \\ \hline
\multicolumn{1}{|l|}{MMWHS CT}         & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}0.94 ± 0.01  \end{tabular}}}            & \multicolumn{1}{c|}{0.93 ± 0.02}                & \multicolumn{1}{c|}{0.90 ± 0.03}         & \multicolumn{1}{c|}{0.91 ± 0.03}        & \multicolumn{1}{c|}{0.88 ± 0.04}         & 0.85 ± 0.05      \\ \hline
\multicolumn{1}{|l|}{MMWHS MRI}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}0.88 ± 0.02\end{tabular}}}            & \multicolumn{1}{c|}{0.86 ± 0.03}                & \multicolumn{1}{c|}{0.85 ± 0.03}         & \multicolumn{1}{c|}{0.84 ± 0.03}        & \multicolumn{1}{c|}{0.81 ± 0.04}         & 0.79 ± 0.05      \\ \hline
\multicolumn{1}{|l|}{WHIS++ CT}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}0.97 ± 0.01  \end{tabular}}}             & \multicolumn{1}{c|}{0.94 ± 0.02}                & \multicolumn{1}{c|}{0.93 ± 0.02}         & \multicolumn{1}{c|}{0.92 ± 0.02}        & \multicolumn{1}{c|}{0.91 ± 0.03}         & 0.87 ± 0.04      \\ \hline
\multicolumn{1}{|l|}{WHIS++ MRI}       & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}0.88 ± 0.02\end{tabular}}}   & \multicolumn{1}{c|}{0.87 ± 0.03}                & \multicolumn{1}{c|}{0.85 ± 0.03}         & \multicolumn{1}{c|}{0.83 ± 0.04}        & \multicolumn{1}{c|}{0.80 ± 0.04}         & 0.78 ± 0.05      \\ \hline
\multicolumn{1}{|l|}{TBI}              & \multicolumn{1}{c|}{\textbf{0.78 ± 0.03}}                                                                 & \multicolumn{1}{c|}{0.72 ± 0.04}                & \multicolumn{1}{c|}{0.70 ± 0.04}         & \multicolumn{1}{c|}{0.68 ± 0.05}        & \multicolumn{1}{c|}{0.66 ± 0.06}         & 0.63 ± 0.06      \\ \hline
\multicolumn{1}{|l|}{ISLES2024}        & \multicolumn{1}{c|}{\textbf{0.84 ± 0.02}}                                                                 & \multicolumn{1}{c|}{0.80 ± 0.03}                & \multicolumn{1}{c|}{0.79 ± 0.03}         & \multicolumn{1}{c|}{0.76 ± 0.04}        & \multicolumn{1}{c|}{0.74 ± 0.04}         & 0.72 ± 0.05      \\ \hline
\multicolumn{7}{|c|}{\textbf{Average HD (±SD)}}                                                                                                                                                                                                                                                                                         \\ \hline
\multicolumn{1}{|l|}{\textbf{Dataset}} & \multicolumn{1}{c|}{\textbf{3D-SegSync}}                                                                  & \multicolumn{1}{c|}{\textbf{3D-SegSync-Bottom}} & \multicolumn{1}{c|}{\textbf{xLSTM-UNet}} & \multicolumn{1}{c|}{\textbf{3D-muUNet}} & \multicolumn{1}{c|}{\textbf{3D-ResUNet}} & \textbf{3D-UNet} \\ \hline
\multicolumn{1}{|l|}{HVSMR-2.0}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}17.25 ± 2.4\end{tabular}}}           & \multicolumn{1}{c|}{24.21 ± 3.2}                & \multicolumn{1}{c|}{21.19 ± 2.8}         & \multicolumn{1}{c|}{22.16 ± 3.1}        & \multicolumn{1}{c|}{28.87 ± 3.7}         & 33.17 ± 4.0      \\ \hline
\multicolumn{1}{|l|}{MMWHS CT}         & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}14.39 ± 1.2\end{tabular}}}   & \multicolumn{1}{c|}{18.62 ± 2.5}                & \multicolumn{1}{c|}{19.56 ± 2.6}         & \multicolumn{1}{c|}{19.14 ± 2.8}        & \multicolumn{1}{c|}{35.48 ± 4.1}         & 38.25 ± 4.5      \\ \hline
\multicolumn{1}{|l|}{MMWHS MRI}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}29.02 ± 3.1 \end{tabular}}} & \multicolumn{1}{c|}{31.76 ± 3.6}                & \multicolumn{1}{c|}{33.41 ± 4.0}         & \multicolumn{1}{c|}{34.12 ± 4.1}        & \multicolumn{1}{c|}{38.62 ± 4.7}         & 40.77 ± 5.1      \\ \hline
\multicolumn{1}{|l|}{WHIS++ CT}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}5.28 ± 0.8\end{tabular}}}    & \multicolumn{1}{c|}{17.94 ± 2.2}                & \multicolumn{1}{c|}{21.58 ± 2.4}         & \multicolumn{1}{c|}{29.60 ± 3.5}        & \multicolumn{1}{c|}{61.24 ± 5.2}         & 65.11 ± 5.6      \\ \hline
\multicolumn{1}{|l|}{WHIS++ MRI}       & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}5.12 ± 0.9 \end{tabular}}}            & \multicolumn{1}{c|}{13.17 ± 1.7}                & \multicolumn{1}{c|}{21.43 ± 2.3}         & \multicolumn{1}{c|}{25.01 ± 2.9}        & \multicolumn{1}{c|}{58.71 ± 5.0}         & 62.88 ± 5.4      \\ \hline
\multicolumn{1}{|l|}{TBI}              & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}19.45 ± 2.6 \end{tabular}}}             & \multicolumn{1}{c|}{23.17 ± 3.1}                & \multicolumn{1}{c|}{24.13 ± 3.4}         & \multicolumn{1}{c|}{27.20 ± 3.9}        & \multicolumn{1}{c|}{33.57 ± 4.3}         & 36.12 ± 4.8      \\ \hline
\multicolumn{1}{|l|}{ISLES2024}        & \multicolumn{1}{c|}{\textbf{\begin{tabular}[c]{@{}c@{}}29.22 ± 3.2  \end{tabular}}}  & \multicolumn{1}{c|}{31.67 ± 3.5}                & \multicolumn{1}{c|}{34.72 ± 3.8}         & \multicolumn{1}{c|}{35.09 ± 4.0}        & \multicolumn{1}{c|}{39.61 ± 4.5}         & 38.87 ± 4.7      \\ \hline
\end{tabular}}
\end{table}

Table \ref{overall} shows that 3D-SegSync outperforms all models in heart and brain imaging datasets, with 3D-SegSync-bottom coming second. This variant fine-tunes only the bottom-layer features, indicating that optimizing all encoder layers improves performance. For a comprehensive comparison with the latest 3D SSL SOTA models (Hi-End-MAE, SwinSSL, SwinMM, and Voco), Table \ref{3dssl} presents performance scores on the MMWHS (CT) dataset for whole heart segmentation. 3D-SegSync achieves the highest Dice score (0.94), lowest HD (14.39), HD95 (4.197), ASSD (0.942), and Vol Diff (0.0062), outperforming other models. Compared to SwinSSL (Dice: 0.909, HD: 19.033) and Hi-End-MAE (Dice: 0.869, HD: 23.212), our model demonstrates superior segmentation accuracy and robustness, highlighting its effectiveness in cardiac medical image analysis. A further explanation of the results of each dataset can be found in \appendixautorefname{B}.




\begin{table}[H]
\centering
\caption{Performance analysis of 3D-SegSync with latest SOTA 3D SSL models for MMWHS(CT) dataset.}
\label{3dssl}
\resizebox{\textwidth}{!}{\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
\textbf{Model}    & \textbf{Dice (±SD)}  & \textbf{HD (±SD)}    & \textbf{HD95 (±SD)}  & \textbf{ASSD (±SD)}  & \textbf{Vol\_Diff (±SD)}  & \textbf{p-value (Dice)} \\ \hline
3D-SegSync        & \textbf{0.94 ± 0.01} & \textbf{14.39 ± 1.2} & \textbf{4.197 ± 0.4} & \textbf{0.942 ± 0.1} & \textbf{0.0062 ± 0.0005} & \textbf{-}              \\ \hline
SwinMM (2023)     & 0.91 ± 0.03          & 17.33 ± 2.8          & 6.762 ± 1.4          & 1.087 ± 0.3          & 0.00785 ± 0.0010          & 0.002*              \\ \hline
SwinSSL (2022)    & 0.90 ± 0.04          & 19.03 ± 3.1          & 8.011 ± 1.7          & 2.089 ± 0.5          & 0.00922 ± 0.0012          & 0.001*              \\ \hline
Voco (2024)       & 0.92 ± 0.02          & 16.78 ± 2.5          & 6.181 ± 1.3          & 1.111 ± 0.3          & 0.00779 ± 0.0009          & 0.003*              \\ \hline
Hi-End-MAE (2025) & 0.86 ± 0.05          & 23.21 ± 3.5          & 10.81 ± 2.0          & 2.009 ± 0.6          & 0.00982 ± 0.0015          & \textless{}0.001*   \\ \hline
\end{tabular}}
\end{table}

\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{heart visualization.png}
    \caption{Quantitative Performance of the Proposed and SOTA Models on the MMWHS CT Dataset. Colour representation: Purple (AO), Yellow (RA), Red (LV), Light Blue (Myo), Gray (PA), Blue (LA), Green (RV).}
    \label{fig: heart visualization}
\end{figure}


Figure \ref{fig: heart visualization} illustrates the quantitative performance of the proposed 3D-SegSync on the MMWHS$\_$CT dataset. The results demonstrate that the proposed 3D-SegSync model achieves a segmentation output that is closely aligned with the ground truth (GT) segmentation map, outperforming other SOTA models across most anatomical regions of the whole heart. Among the comparative models, 3D-ResUNet and 3D-UNet demonstrate a higher rate of segmentation errors, especially in the pulmonary veins and aorta. The SOTA 3D-nnUNet, while performing comparatively better, exhibits noticeable errors in the right atrium, as shown in the 3D segmentation map. These observations provide valuable insights into the potential areas for further refinement in cardiac segmentation methods.

Beyond whole-heart segmentation, we validated the efficacy of our proposed model on the neurological/brain imaging datasets such as the TBI dataset (see \figureref{fig:brain_plots}) from the MICCAI (Medical Image Computing and Computer Assisted Intervention) 2024 Grand Challenge. Our model secured first place in the TBI validation and testing phases, demonstrating its exceptional accuracy and generalisability. The leaderboard for the TBI challenge can be accessed at \href{https://aims-tbi.grand-challenge.org/evaluation/final-test-phase/leaderboard/}{https://aims-tbi.grand-challenge}, where our team, DeepLearnAI, is listed in the top position. Additionally, we tested our model on the ISLES-2024 stroke challenge (see \figureref{fig:brain_plots}), achieving first place on the leaderboard under the team name Dolphins. The leaderboard for the ISLES challenge can be viewed at \href{https://isles-24.grand-challenge.org/evaluation/preliminary-docker-evaluation/leaderboard/}{https://isles-24.grand}. These achievements on both TBI and ISLES-2024 challenges underline the superior performance of our proposed model compared to other SOTA deep learning approaches.

 \begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{brain visulazation.png}
    \caption{Quantitative analysis of the proposed 3D-SegSync model compared to state-of-the-art (SOTA) models for TBI lesion segmentation.}
    \label{fig:brain visulazation}
\end{figure}

Figure \ref{fig:brain visulazation} shows a quantitative analysis of the 3D-SegSync model for TBI lesion segmentation, demonstrating its superior accuracy compared to state-of-the-art models. Unlike the baseline xLSTM model, 3D-SegSync captures the intricate features of moderate to severe TBI (msTBI) lesions, overcoming challenges of high variability for precise segmentation. Its success across various datasets and modalities highlights its robustness and versatility. By leveraging pre-trained SSL features, the model reduces dependence on large labeled datasets, making it highly effective in medical imaging with limited annotated data. These results position 3D-SegSync as a reliable solution for medical image segmentation.



Future work could involve applying SSL to larger datasets for improved generalizability, extending 3D-SegSync to other imaging modalities (e.g., ultrasound, PET), and incorporating multi-modal data (e.g., clinical or genomic data) to improve diagnostic accuracy. Incorporating interpretability techniques could further enhance trust in clinical applications. Addressing these areas will help 3D-SegSync evolve into a more powerful tool for medical imaging.


\section{Conclusion} 
We introduced 3D-SegSync, a robust 3D medical image segmentation framework designed to address challenges like data scarcity, modality variability, and anatomical complexity. By combining the DINOv2 teacher-student architecture with the xLSTM-UNet, 3D-SegSync leverages self-supervised learning to extract rich, modality-independent 3D features from large-scale unlabeled datasets. The xLSTM-UNet further enhances the model’s ability to capture spatial and contextual relationships in 3D imaging, making it highly effective for segmentation tasks. This fully 3D framework achieves state-of-the-art performance in whole-heart, stroke lesion, and traumatic brain injury segmentation across CT and MRI, significantly reducing dependence on labeled datasets. By uniting powerful 3D self-supervised learning with efficient design, 3D-SegSync sets a new benchmark in medical imaging, offering improved scalability and clinical relevance. Future work will explore its application to broader tasks, strengthening its cross-modality capabilities and impact.



\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We would like to express our gratitude to all those who contributed to this work. Their insightful feedback, support, and collaboration were invaluable.}


\bibliography{midl25_77}

\clearpage






















\appendix

\section{Proposed Framework}
\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{framework.png}
    \caption{3D-SegSync framework architecture ( 1. data curation, 2. pretraining using
SSL on unlabelled datasets on cardiac and brain images, 3. fine-tuning step for cardiac and
brain image segmentation using labeled datasets, 4. Performance analysis of 3D-Segsync
with SOTA models).}
    \label{fig:framework}
\end{figure}


\section{Comparsions of 3D-SegSync with SOTA models}


\figureref{fig:ct_MRI_Comp} demonstrates 3D-SegSync’s ability to generalize across different imaging modalities, outperforming SOTA models in both CT and MRI datasets for whole-heart segmentation. This ability to learn modality-independent features via SSL pre-training ensures its applicability in clinical settings where multimodal imaging is common.

\figureref{fig:sig_heart} presents significance maps, where 3D-SegSync consistently shows higher yellow regions, indicating statistically significant improvements in Dice and HD95 metrics compared to other models. 3D SegSync$\_$bottom shows fewer yellow regions in comparison with its advance version 3D-SegSync, reflecting its weaker performance, while 3D-UNet displays highest blue regions, indicating significantly lowest performance among all models in most of the datasets such as HVSMR and WHS++. We have also given the model ranks on the all whole heart segmentation datasets in \figureref{fig:raning_heart}.

\begin{table}[H]
\centering
\caption{Performance analysis of proposed and state of the art models for HVSMR-2.0 dataset.}
\label{HVSMR-2.0}
\resizebox{\textwidth}{!}{\begin{tabular}{|l|c|c|c|c|c|}
\hline
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg\_All}} & {\color[HTML]{000000} \textbf{HD\_Avg\_All}} & {\color[HTML]{000000} \textbf{HD95\_Avg\_All}} & {\color[HTML]{000000} \textbf{ASSD\_Avg\_All}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg\_All}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.77132}                 & {\color[HTML]{000000} 17.25634}              & {\color[HTML]{000000} 8.006959}                & {\color[HTML]{000000} 2.02056}                 & {\color[HTML]{000000} 0.02273048}                   \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.753696}                & {\color[HTML]{000000} 24.21755}              & {\color[HTML]{000000} 10.89928}                & {\color[HTML]{000000} 3.03842}                 & {\color[HTML]{000000} 0.024452868}                  \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.766224}                & {\color[HTML]{000000} 29.22956}              & {\color[HTML]{000000} 11.74866}                & {\color[HTML]{000000} 3.100872}                & {\color[HTML]{000000} 0.026769222}                  \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.745084}                & {\color[HTML]{000000} 22.12653}              & {\color[HTML]{000000} 10.00376}                & {\color[HTML]{000000} 2.39672}                 & {\color[HTML]{000000} 0.026893855}                  \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.706659}                & {\color[HTML]{000000} 68.07598}              & {\color[HTML]{000000} 26.91917}                & {\color[HTML]{000000} 7.344615}                & {\color[HTML]{000000} 0.037925875}                  \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.670413}                & {\color[HTML]{000000} 35.08616}              & {\color[HTML]{000000} 16.39909}                & {\color[HTML]{000000} 4.138591}                & {\color[HTML]{000000} 0.037455937}                  \\ \hline
\end{tabular}}
\end{table}

\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{sig_heart.png}
    \caption{Significance maps of the proposed 3D-SegSync and SOTA models on DICE (a) and HD-95 (b) metrics across Whole Heart segmentation datasets.}
    \label{fig:sig_heart}
\end{figure}
\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{ct_MRI_Comp.png}
    \caption{Cross modality performance comparison of 3D-Segsync and SOTA models for whole-heart how segmentation across CT and MRI datasets. (a) WHS++ dataset, where green bars show CT and mustard bars show MRI. (b) MMWHS dataset, where orange bars show CT and blue bars show MRI.}
    \label{fig:ct_MRI_Comp}
\end{figure}



 
Table \ref{HVSMR-2.0} shows the segmentation results on the HVSMR-2.0 whole-heart MRI dataset demonstrate that 3D-SegSync achieved the best overall performance, with the highest Dice score (0.7713), the lowest ASSD (2.0206 mm), and the smallest volume difference (0.0227), indicating accurate overlap, surface alignment, and volume estimation. 3D-SegSync$\_$Botom and xLSTM-UNET also performed well, with Dice scores of 0.7537 and 0.7662, respectively, though both exhibited higher Hausdorff distances (HD95 of 10.8993 mm and 11.7487 mm) and ASSD values, reflecting less precise boundary alignment. While 3D-nnUNet showed good surface alignment (ASSD of 2.3967 mm), its lower Dice score (0.7451) and higher HD95 (10.0038 mm) suggest moderate segmentation accuracy. In contrast, 3D-ResUNet and 3D-UNet underperformed, with significantly lower Dice scores (0.7067 and 0.6704) and much higher HD95 (26.9192 mm and 16.3991 mm), indicating poor boundary and surface alignment. Overall, 3D-SegSync is the most reliable model for whole-heart segmentation in this dataset.
\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{raning_heart.png}
    \caption{Blob plots illustrating the stability of rankings of whole heart segmentation datasets based on bootstrap sampling. The median rank for each algorithm is represented by a black cross, while the 95$\%$ bootstrap intervals across samples are depicted as black lines.}
    \label{fig:raning_heart}
\end{figure}

Table \ref{MMWHS CT} shows the segmentation results on the MMWHS dataset demonstrate that 3D-SegSync achieved the best overall performance, with the highest Dice score (0.9415), the lowest ASSD (0.9425 mm), and a minimal volume difference (0.0062), indicating excellent overlap, surface alignment, and volume estimation. 3D-SegSync-Botom also performed well, with a Dice score of 0.9316 and the lowest HD95 (3.9579 mm), though it showed slightly higher ASSD (0.9644 mm). xLSTM-UNET attained a Dice score of 0.9277 but exhibited higher HD95 (5.0222 mm) and ASSD (1.0150 mm), reflecting less precise boundary alignment. 3D-nnUNet demonstrated moderate performance, with a Dice score of 0.9175 and higher HD95 (7.1660 mm) and ASSD (1.3304 mm). In contrast, 3D-ResUNet and 3D-UNet underperformed, with significantly lower Dice scores (0.8916 and 0.8864) and considerably higher HD95 (7.2059 mm and 8.8210 mm) and ASSD (1.6663 mm and 2.0362 mm), indicating poor boundary and surface alignment. Overall, 3D-SegSync is the most effective model for whole-heart segmentation in this dataset.

\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using MMWHS CT dataset.}
\label{MMWHS CT}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}   & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}       & {\color[HTML]{000000} 0.941531281}        & {\color[HTML]{000000} 14.3973592}       & {\color[HTML]{000000} 4.197811849}        & {\color[HTML]{000000} 0.942519917}        & {\color[HTML]{000000} 0.006214206}             \\ \hline
{\color[HTML]{000000} 3D-SegSync-Botom} & {\color[HTML]{000000} 0.93157309}         & {\color[HTML]{000000} 18.8680027}       & {\color[HTML]{000000} 3.95788033}         & {\color[HTML]{000000} 0.964365257}        & {\color[HTML]{000000} 0.006409229}             \\ \hline
{\color[HTML]{000000} xLSTM-UNET}       & {\color[HTML]{000000} 0.927654749}        & {\color[HTML]{000000} 17.6516858}       & {\color[HTML]{000000} 5.022211073}        & {\color[HTML]{000000} 1.014969103}        & {\color[HTML]{000000} 0.006599423}             \\ \hline
{\color[HTML]{000000} 3D-nnUNet}        & {\color[HTML]{000000} 0.917473901}        & {\color[HTML]{000000} 19.6456178}       & {\color[HTML]{000000} 7.166018194}        & {\color[HTML]{000000} 1.330353153}        & {\color[HTML]{000000} 0.006746526}             \\ \hline
{\color[HTML]{000000} 3D-ResUNet}       & {\color[HTML]{000000} 0.891598521}        & {\color[HTML]{000000} 35.883997}        & {\color[HTML]{000000} 7.205884387}        & {\color[HTML]{000000} 1.666301144}        & {\color[HTML]{000000} 0.006330065}             \\ \hline
{\color[HTML]{000000} 3D-UNet}          & {\color[HTML]{000000} 0.886358506}        & {\color[HTML]{000000} 58.1659145}       & {\color[HTML]{000000} 8.820965598}        & {\color[HTML]{000000} 2.036248398}        & {\color[HTML]{000000} 0.008241135}             \\ \hline
\end{tabular}}
\end{table}


Tables \ref{MMWHS MRI}, \ref{WHS++ CT}, and \ref{WHS++ MRI}present the results of CT and MRI whole-heart segmentation, where the proposed 3D-SegSync model, employing a self-supervised learning approach, consistently outperformed state-of-the-art (SOTA) models across Dice and other key performance metrics. The model demonstrated superior accuracy in overlap, boundary alignment, and volume estimation, validating its effectiveness for both CT and MRI modalities. Additionally, the model was evaluated on the WHS++ dataset, where similar performance trends were observed, reinforcing its generalisability to different datasets. 

Furthermore, the proposed 3D-SegSync was assessed against SOTA models for stroke lesion segmentation (ISLES2024) and traumatic brain injury (TBI) lesion segmentation tasks. In these evaluations (Tables \ref{TBI MRI} and \ref{ISLES2024}), 3D-SegSync consistently delivered robust and reliable performance, showcasing its versatility and efficacy in segmenting diverse anatomical and pathological structures. These results highlight the potential of the self-supervised 3D-SegSync model to set a new standard in medical image segmentation across multiple domains.
 
\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using MMWHS MRI dataset.}
\label{MMWHS MRI}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.87167}            & {\color[HTML]{000000} 29.02241}         & {\color[HTML]{000000} 6.871293}           & {\color[HTML]{000000} 1.831784}           & {\color[HTML]{000000} 0.008656418}             \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.86413}            & {\color[HTML]{000000} 46.32272}         & {\color[HTML]{000000} 6.813768}           & {\color[HTML]{000000} 2.231868}           & {\color[HTML]{000000} 0.009208905}             \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.86338}            & {\color[HTML]{000000} 51.50929}         & {\color[HTML]{000000} 7.029732}           & {\color[HTML]{000000} 2.266159}           & {\color[HTML]{000000} 0.008637976}             \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.85904}            & {\color[HTML]{000000} 23.19734}         & {\color[HTML]{000000} 7.03841}            & {\color[HTML]{000000} 1.946138}           & {\color[HTML]{000000} 0.010388217}             \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.84187}            & {\color[HTML]{000000} 42.48808}         & {\color[HTML]{000000} 7.26451}            & {\color[HTML]{000000} 2.05212}            & {\color[HTML]{000000} 0.00868741}              \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.83663}            & {\color[HTML]{000000} 86.60793}         & {\color[HTML]{000000} 17.44347}           & {\color[HTML]{000000} 3.115857}           & {\color[HTML]{000000} 0.009206529}             \\ \hline
\end{tabular}}
\end{table}


\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using WHS++ CT dataset.}
\label{WHS++ CT}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.976996}           & {\color[HTML]{000000} 5.281255}         & {\color[HTML]{000000} 1.251489}           & {\color[HTML]{000000} 0.301377}           & {\color[HTML]{000000} 0.001588615}             \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.941146}           & {\color[HTML]{000000} 17.94436}         & {\color[HTML]{000000} 5.572338}           & {\color[HTML]{000000} 1.184542}           & {\color[HTML]{000000} 0.003344029}             \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.935651}           & {\color[HTML]{000000} 21.58496}         & {\color[HTML]{000000} 8.052558}           & {\color[HTML]{000000} 1.677745}           & {\color[HTML]{000000} 0.003928403}             \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.927362}           & {\color[HTML]{000000} 29.60166}         & {\color[HTML]{000000} 6.834203}           & {\color[HTML]{000000} 1.301429}           & {\color[HTML]{000000} 0.003741077}             \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.915044}           & {\color[HTML]{000000} 64.24364}         & {\color[HTML]{000000} 13.20491}           & {\color[HTML]{000000} 2.291552}           & {\color[HTML]{000000} 0.004881045}             \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.872164}           & {\color[HTML]{000000} 80.54156}         & {\color[HTML]{000000} 26.13814}           & {\color[HTML]{000000} 5.87906}            & {\color[HTML]{000000} 0.00934523}              \\ \hline
\end{tabular}}
\end{table}

\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using WHS++ MRI dataset.}
\label{WHS++ MRI}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.887617}           & {\color[HTML]{000000} 12.19138}         & {\color[HTML]{000000} 5.696797}           & {\color[HTML]{000000} 1.577587}           & {\color[HTML]{000000} 0.006581231}             \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.886642}           & {\color[HTML]{000000} 13.17895}         & {\color[HTML]{000000} 5.460724}           & {\color[HTML]{000000} 1.625685}           & {\color[HTML]{000000} 0.006799575}             \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.87293}            & {\color[HTML]{000000} 16.08778}         & {\color[HTML]{000000} 7.373174}           & {\color[HTML]{000000} 1.892255}           & {\color[HTML]{000000} 0.007296956}             \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.869232}           & {\color[HTML]{000000} 16.02272}         & {\color[HTML]{000000} 7.136496}           & {\color[HTML]{000000} 1.927073}           & {\color[HTML]{000000} 0.007132139}             \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.851003}           & {\color[HTML]{000000} 20.14479}         & {\color[HTML]{000000} 6.741368}           & {\color[HTML]{000000} 1.969676}           & {\color[HTML]{000000} 0.007659463}             \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.858753}           & {\color[HTML]{000000} 16.35387}         & {\color[HTML]{000000} 7.723141}           & {\color[HTML]{000000} 2.168973}           & {\color[HTML]{000000} 0.008124995}             \\ \hline
\end{tabular}}
\end{table}

\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using TBI MRI dataset.}
\label{TBI MRI}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.782532}           & {\color[HTML]{000000} 19.45671}         & {\color[HTML]{000000} 15.36666}           & {\color[HTML]{000000} 2.45671}            & {\color[HTML]{000000} 0.275341}                \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.724659}           & {\color[HTML]{000000} 23.17893}         & {\color[HTML]{000000} 17.99199}           & {\color[HTML]{000000} 3.01345}            & {\color[HTML]{000000} 0.492328}                \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.686121}           & {\color[HTML]{000000} 24.13334}         & {\color[HTML]{000000} 19.92581}           & {\color[HTML]{000000} 3.18965}            & {\color[HTML]{000000} 0.718707}                \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.678905}           & {\color[HTML]{000000} 25.31134}         & {\color[HTML]{000000} 17.84589}           & {\color[HTML]{000000} 3.30567}            & {\color[HTML]{000000} 0.739339}                \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.678905}           & {\color[HTML]{000000} 27.21234}         & {\color[HTML]{000000} 17.84589}           & {\color[HTML]{000000} 4.17865}            & {\color[HTML]{000000} 0.786737}                \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.643323}           & {\color[HTML]{000000} 28.18971}         & {\color[HTML]{000000} 18.62511}           & {\color[HTML]{000000} 6.34567}            & {\color[HTML]{000000} 0.852835}                \\ \hline
\end{tabular}}
\end{table}

\begin{table}[H]
\centering
\caption{Performance analysis of proposed and SOTA models using ISLES2024 dataset.}
\label{ISLES2024}
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
{\color[HTML]{000000} \textbf{Model}}    & {\color[HTML]{000000} \textbf{Dice\_Avg}} & {\color[HTML]{000000} \textbf{HD\_Avg}} & {\color[HTML]{000000} \textbf{HD95\_Avg}} & {\color[HTML]{000000} \textbf{ASSD\_Avg}} & {\color[HTML]{000000} \textbf{Vol\_Diff\_Avg}} \\ \hline
{\color[HTML]{000000} 3D-SegSync}        & {\color[HTML]{000000} 0.848296}           & {\color[HTML]{000000} 29.22114}         & {\color[HTML]{000000} 21.05371}           & {\color[HTML]{000000} 2.78653}            & {\color[HTML]{000000} 0.840097}                \\ \hline
{\color[HTML]{000000} 3D-SegSync\_Botom} & {\color[HTML]{000000} 0.804858}           & {\color[HTML]{000000} 31.67891}         & {\color[HTML]{000000} 20.2839}            & {\color[HTML]{000000} 1.89765}            & {\color[HTML]{000000} 0.840068}                \\ \hline
{\color[HTML]{000000} xLSTM-UNET}        & {\color[HTML]{000000} 0.798529}           & {\color[HTML]{000000} 34.72802}         & {\color[HTML]{000000} 25.56568}           & {\color[HTML]{000000} 2.93112}            & {\color[HTML]{000000} 0.865685}                \\ \hline
{\color[HTML]{000000} 3D-nnUNet}         & {\color[HTML]{000000} 0.769919}           & {\color[HTML]{000000} 35.09987}         & {\color[HTML]{000000} 27.30891}           & {\color[HTML]{000000} 2.98765}            & {\color[HTML]{000000} 0.887896}                \\ \hline
{\color[HTML]{000000} 3D-ResUNet}        & {\color[HTML]{000000} 0.74243459}         & {\color[HTML]{000000} 39.61133}         & {\color[HTML]{000000} 30.11339}           & {\color[HTML]{000000} 4.23478}            & {\color[HTML]{000000} 0.900569}                \\ \hline
{\color[HTML]{000000} 3D-UNet}           & {\color[HTML]{000000} 0.702374}           & {\color[HTML]{000000} 38.87633}         & {\color[HTML]{000000} 30.9896}            & {\color[HTML]{000000} 6.13459}            & {\color[HTML]{000000} 0.946751}                \\ \hline
\end{tabular}}
\end{table}


\section{Methodology $\&$ Mathematical Details of the Proposed Framework}

\subsection{Methodology}
The proposed framework is built on a self-supervised learning (SSL) \cite{mazher2024self} approach designed to pre-train a 3D Vision-LSTM (xLSTM) integrated UNet model (xLSTM-UNet) \cite{oquab2023dinov2, chen2024xlstm}. The methodology combines advanced deep learning techniques to achieve enhanced performance in 3D medical image segmentation tasks. The main diagram of proposed SSL model is shown in \appendixautorefname{A}.
	
\subsubsection{Data Augmentation in the Student-Teacher Framework}

Robust data augmentation plays a critical role in the SSL pipeline. Techniques such as flipping, scaling, Gaussian noise addition, Gaussian blur, and adjustments to brightness and contrast are applied to create diverse and informative training inputs. Two augmented views of each input image are generated and processed through a Siamese network structure, comprising the student and teacher encoders. The teacher encoder’s outputs are refined through centring, sharpening, and normalisation via a softmax function, producing supervision signals for the student encoder.

The loss function ensures alignment between the student’s outputs and the teacher’s processed outputs by minimising divergence, employing cross-entropy loss and mean squared error (MSE) \cite{oquab2023dinov2}. This alignment facilitates robust feature learning from unlabelled data, enhancing the model’s generalisation capabilities.
	
\subsubsection{xLSTM-UNet Architecture}

The xLSTM-UNet model \cite{chen2024xlstm} integrates Vision-LSTM (xLSTM), an advanced extension of Long Short-Term Memory (LSTM) networks, into the UNet architecture. xLSTM excels at capturing long-range dependencies and contextual information, complementing the UNet’s strength in extracting local features through its convolutional encoder-decoder design. The encoder identifies hierarchical features from the input, while the decoder reconstructs these features into detailed segmentation maps, enabling precise and reliable segmentation.
	
\subsubsection{Self-Supervised Pre-Training and Supervised Fine-Tuning}

The SSL framework focuses on pre-training the xLSTM-UNet encoder using unlabelled data to capture meaningful spatial and contextual features. Once pre-trained, the encoder is fine-tuned in a supervised manner using labelled datasets, optimising the decoder to generate accurate segmentation maps. This two-stage process minimises the reliance on extensive labelled datasets, while the xLSTM module ensures effective learning of global context and long-range dependencies.


\subsection{Mathematical Details of the Proposed Framework}
The momentum teacher encoder’s parameters $\theta_t$ are updated based on the student encoder’s parameters $\theta_s$ using a momentum-based approach:


\begin{equation}
\theta_t = m \cdot \theta_t + (1-m) \cdot \theta_s
\end{equation}

Where $\theta_t$ are the parameters of the teach encoder, $\theta_s$ are the parameters of the student encoder, $m$ is the momentum coefficient typically a value close to 1.

Let  $x$ be the original input image. Two different views of the input, $x_1$  and, $x_2$ are generated using strong data augmentations:

\begin{equation}
x_1 = \text{Augment}(x), \quad x_2 = \text{Augment}(x)
\end{equation}

Both views are then processed through the student encoder $f_s$ and teacher encoder $f_t$ to extract feature representations:

\begin{equation}
h_1 = f_s(x_1; \theta_s), \quad h_2 = f_s(x_2; \theta_s)
\end{equation}

\begin{equation}
h_1' = f_s(x_1; \theta_t), \quad h_2' = f_s(x_2; \theta_t)
\end{equation}



Where $h_1$ and $h_2$ are the feature representations from the student encoder, $h_1'$ and $h_2'$ are the feature representations from the teacher encoder.

The feature representations $h_1$, $h_2$, $h_1'$ , $h_2'$ are subjected to global average pooling to reduce them into feature vectors:

\begin{equation}
v_1 = \text{GAP}(h_1), \quad v_2 = \text{GAP}(h_2)
\end{equation}

\begin{equation}
v_1' = \text{GAP}(h_1'), \quad v_2' = \text{GAP}(h_2')
\end{equation}

Where  $v_1$, $v_2$, $v_1'$ and $v_2'$ are the resulting feature vectors.

\begin{equation}
z_1 = \text{MLP}(v_1), \quad z_2 = \text{MLP}(v_2)
\end{equation}

\begin{equation}
z_1' = \text{MLP}(v_1'), \quad z_2' = \text{MLP}(v_2')
\end{equation}

After projection, the teacher’s output is centered, sharpened, and passed through a softmax function to produce the supervision signal:

\begin{equation}
q_1' = \text{Softmax}\left(\frac{\text{Center}(z_1')}{\tau}\right)
\end{equation}

\begin{equation}
q_2' = \text{Softmax}\left(\frac{\text{Center}(z_2')}{\tau}\right)
\end{equation}


Where $Center (z)$ subtracts the mean of the vector to have zero mean. $\tau$ is the temperature parameter controlling the sharpness of the distribution. $Softmax(z)$ normalizes the vector into a probability distribution.

The loss function is designed to minimize the divergence between the student’s feature vectors and the teacher’s processed outputs. A common choice is the cross-entropy loss or mean squared error (MSE) between the student's and teacher's outputs:

\begin{equation}
L = \frac{1}{2} \left( \text{Loss}(z_1, q_2') + \text{Loss}(z_2, q_1') \right)
\end{equation}

Where this loss function encourages the student encoder to produce feature representations that align closely with the teacher's outputs, thus enabling effective learning from the unlabeled data. 

For the downstreaming segmentation training task we used the cross-entropy loss as given in the equation below :

\begin{equation}
L(z, q') = - \sum_{k=1}^K q'[k] \log(\text{Softmax}(z)[k])
\end{equation}
\end{document}
