\documentclass{midl} % Include author names
% \documentclass[anon]{midl} % Anonymized submission

% For tables
\usepackage{float}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{adjustbox}
\usepackage{transparent}
\usepackage[hypcap=false]{caption}
\usepackage[accsupp]{axessibility} % Improves PDF readability for those with disabilities.

% \definecolor{rebuttalcolor}{RGB}{0,165,135}
% \newcommand*{\rebuttal}[1]{\textcolor{rebuttalcolor}{#1}}
% \newcommand*{\todo}[1]{\textcolor{red}{#1}}
\newcommand*{\method}[1]{UnCLe SAM}

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- 028}
\editors{Accepted for publication at MIDL 2024}

% \usepackage{mwe} % to get dummy images
% \jmlrvolume{-- Under Review}
% \jmlryear{2024}
% \jmlrworkshop{Full Paper -- MIDL 2024 submission}
% \editors{Under Review for MIDL 2024}


\title[\method{}]{\method{}: Unleashing SAM’s Potential for Continual Prostate MRI Segmentation}

\midlauthor{\Name{Amin Ranem}\nametag{$^{1}$} \Email{amin.ranem@gris.informatik.tu-darmstadt.de}\\
\Name{Aflam Ahlal}\nametag{$^{1}$}
\Email{afham.mohamed@stud.tu-darmstadt.de}\\
\Name{Moritz Fuchs}\nametag{$^{1}$}
\Email{moritz.fuchs@gris.informatik.tu-darmstadt.de}\\
\Name{Anirban Mukhopadhyay}\nametag{$^{1}$} \Email{anirban.mukhopadhyay@gris.informatik.tu-darmstadt.de}\\
\addr $^{1}$ Technical University of Darmstadt, Karolinenpl. 5, 64289 Darmstadt, Germany\\}


 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
% \midlauthor{\Name{Author Name1\midljointauthortext{Contributed equally}\nametag{$^{1,2}$}} \Email{abc@sample.edu}\\
% \addr $^{1}$ Address 1 \\
% \addr $^{2}$ Address 2 \AND
% \Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}\\
% \Name{Author Name3\nametag{$^{2}$}} \Email{alphabeta@example.edu}\\
% \Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} \Email{uvw@foo.ac.uk}\\
% \addr $^{3}$ Address 3 \AND
% \Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
% \addr $^{4}$ Address 4
% }

\begin{document}

\maketitle

\begin{abstract}
Continual medical image segmentation primarily explores the utilization of U-Net and its derivatives within the realm of medical imaging, posing significant challenges in meeting the demands of shifting domains over time. Foundation models serve as robust knowledge repositories, offering unique advantages such as general applicability, knowledge transferability, and continuous improvements. By leveraging pre-existing domain insights, adaptability, generalization, and performance across diverse tasks can be enhanced.
In this work, we show how to deploy Segment Anything Model's (SAM) natural image pretraining for the continual medical image segmentation, where data is sparse.
We introduce \textbf{\method{}}, a novel approach that uses the knowledge of the pre-trained SAM foundation model to make it suitable for continual segmentation in dynamic environments.
We demonstrate that \method{} is a robust alternative to U-Net-based approaches and showcase its state-of-the-art (SOTA) continual medical segmentation capabilities.
The primary objective of \method{} is to strike a delicate balance between model rigidity and plasticity, effectively addressing prevalent pitfalls within CL methodologies.
We assess \method{} through a series of prostate segmentation tasks, applying a set of different CL methods. Comparative evaluations against the Lifelong nnU-Net framework reveal the potential application of \method{} in dynamically changing environments like healthcare.
Our code base is available at \url{https://github.com/MECLabTUDA/UnCLeSAM/}.
\end{abstract}

\begin{keywords}
Continual learning, Foundation Model, Segment Anything Model
\end{keywords}

\section{Introduction}
%% Why this paper ?
Continual learning (CL) holds immense significance in \textit{safety-critical applications} of Deep Learning. This is evident in healthcare, where models must adapt to data changes over time while maintaining high performance on older data \cite{gonzalez2020wrong}. 
Traditional \textit{U-Net architectures encounter difficulties} in seamlessly adapting to domain changes in data distribution, particularly when faced with new imaging protocols or variations in patient populations or diseases \cite{sanner2021reliable, derakhshani2022lifelonger, gonzalez2022task, pmlr-v172-fuchs22a}. The challenge lies in training models that exhibit superior performance when using datasets with limited temporal availability. Finding a good trade-off between rigidity which hinders learning new tasks and plasticity causing catastrophic forgetting on previous tasks is therefore important \cite{kirkpatrick2017overcoming, hadsell2020embracing, de2021continual}.
Existing CL methods, when applied to medical data, often result in segmentations that fall short of basic semantic standards like semantic coherence over time \cite{ranem2022continual, gonzalez2023lifelong}.

\begin{figure}[htp]
    \centering
    \includegraphics[trim=0 6cm 3cm 0, clip, width=0.95\textwidth]{images/intro.pdf}
    \caption{Unlike traditional static training (left), continual U-Net training (middle) involves time-limited access to training data. Data arrives sequentially, and the model lacks access to previous data. In contrast, \method{} (right) continuously adapts the adapter with sequentially arriving data while benefiting from SAM's pre-trained knowledge base.}
    \label{fig:intro}
\end{figure}

%% How different from SOTA ?
We introduce \textbf{\method} (Unleashing Continual Learning for SAM), a novel approach that leverages the Segment Anything Model (SAM) \cite{kirillov2023segment} for \textit{enhanced domain adaptation} in continuous medical setups. Our method involves continually adapting the prompt for SAM while leveraging the knowledge base of SAM without a full re-training as in MedSAM \cite{ma2023segment}.

The Lifelong nnU-Net framework and other CL methods, such as Elastic Weight Consolidation (EWC) \cite{kirkpatrick2017overcoming}, Riemannian Walk (RWalk) \cite{chaudhry2018riemannian}, or basic replay methods like iCARL \cite{rebuffi2017icarl}, struggle to adequately adapt to changing domains \cite{gonzalez2023lifelong}. Using replay, regularization, or knowledge distillation has its advantages and disadvantages when it comes to domain shifts in continuous setups. For instance, there is a rigidity/plasticity trade-off or computational burden on performance.
% However, the challenges that come with domain shifts can not solely be solved by CL methods, but rather getting away from the traditional U-Net-like architectures helps better to adapt according to the changing domain. 
Exploring CL by using foundation models, until now, remained an unexplored approach, Figure \ref{fig:intro}.

\textit{\method{} strategically addresses challenges for domain adaptation} known when applying U-Net-based architectures. U-Nets encounter challenges in maintaining segmentation accuracy amidst variations in imaging protocols or discrepancies in patient populations, resulting in compromised performance and reduced reliability of the model in dynamic clinical scenarios \cite{gonzalez2020wrong,ranem2022continual,sanner2021reliable}. Leveraging the robust knowledge from SAM while continually adapting the prompting adapter reduces such challenges.
% Additionally, when applying CL methods like EWC, RWalk, or rehearsal \cite{kirkpatrick2017overcoming, rebuffi2017icarl, chaudhry2018riemannian}, it has been observed that the models tend to be too rigid, i.e., not able to acquire new knowledge.

In this work, we use the pre-trained SAM architecture for continual prostate MRI segmentation that \textit{leverage the foundation model's knowledge to properly adapt to shifting domains}. Rather than attempting to apply regularization to the network, we opt to freeze certain architectural components such as the Vision Transformer, i.e., SAM's Encoder while \textit{continually adapting the prompt for SAM}. With this approach, the learned visual representation can be transfered to different domains \cite{ma2023segment}.
% This is because the learned visual representation within such components does not necessarily need to change and can be applied to other domains. Training large Vision Transformer components from scratch requires a substantial amount of data, which is often sparse in the medical domain. In contrast, we show by transferring and leveraging existing knowledge, we can effectively fulfill this task.

%% What exactly are we doing ?
% \method{} leverages the knowledge of the pre-trained SAM foundation model to overcome domain shifts in dynamic environments by providing a robust knowledge base. 
SAM's ability to segment anything in diverse contexts becomes a valuable asset for continual adaptation, ensuring that the model maintains high segmentation performance across evolving datasets with simple fine-tuning techniques. By using a pre-trained ResNet-50 network, \cite{he2016deep} as an Adapter to continually updating the SAM prompt, \method{} effectively handles challenges that come with domain adaptations commonly faced by U-Net-based architectures. \textit{\method{} does not require a long time to train} on a new domain, which makes it superior in terms of applicability while achieving SOTA performance. To validate our approach, we focus on the critical task of prostate segmentation for T2-weighted MRIs, which plays an important role in prostate cancer diagnosis and treatment planning.  Our contributions are three-fold: We (1) introduce a \textbf{Continual prompting of foundation model} for medical image segmentation, that can (2) successfully respond to \textbf{domain adaptation} by achieving (3) \textbf{superior performance than} Lifelong nnU-Net Framework.

% \textit{\method{} successfully addresses domain adaptation} challenges that naturally occur over time while only having a sparse availability of data. It also opens new ways to leverage the unique capabilities of SAM in the context of dynamic clinical setups.

\section{Methodology}
\paragraph{Fundamentals}
We start by introducing some key terminology: $\Omega \subset \mathbb{R}^3$ defines a 3D spatial domain as we work with three-dimensional Magnetic Resonance (MR) scans. 
$\mathcal{T}_i \subset \Omega_{\mathcal{T}}$ is referred to a single task $i$, whereas $\Omega_{\mathcal{T}}$ represents a set of tasks. A stage $j$ in a continual setup defines the process of training the model on task $\mathcal{T}_j$ after it has been trained on all previous $\{\mathcal{T}_1, \dots, \mathcal{T}_{j-1}\}$ tasks using some CL method. 


\paragraph{Basic components of SAM}

SAM \cite{kirillov2023segment}, consists of a Vision Transformer (ViT) \cite{dosovitskiy2020image} as its core feature extractor and a segmentation head in form of a mask decoder, responsible for generating precise segmentation masks. Trained on a large database of general images, SAM has garnered a robust knowledge base that facilitates its adaptability across various domains opening doors for continual setups.


The ViT feature extractor within SAM effectively captures visual information from input images, creating detailed embeddings. SAM's segmentation head complements the ViT feature extractor by processing the embeddings to produce detailed segmentation masks. This mask decoder is trained to accurately delineate regions of interest within the input images. Moreover, the segmentation head can makes use of different prompts such as 2D points and bounding boxes to further guide the segmentation process. SAM was trained on a large database of general images to establish a strong foundation for adaptable segmentation tasks. This foundation enables SAM to excel in various domains, making it particularly well-suited for continual setups where adaptability is paramount.


\paragraph{\method{}: Continual Prompting for Enhanced Adaptability}
\method{} builds upon the foundation of SAM by introducing a novel approach to enhance adaptability over time. Since SAM generalizes to different domains, \method{} enhances this adaptability by introducing continual prompting using a ResNet-50 adapter. The architecture of \method{} is carefully crafted to leverage the global knowledge stored within the pre-trained SAM, offering adaptability in dynamic environments like healthcare. A key aspect of our methodology is the decision to keep SAM's ViT backbone frozen, ensuring consistent feature extraction across different datasets and imaging modalities. This approach not only enhances feature extraction reliability but also lays the groundwork for seamless adaptation to changing domain characteristics, see Figure \ref{fig:sam_baseline}. By leveraging the frozen ViT backbone, we implement a pre-processing step to extract embeddings from both training and testing sets, aligning with the approach proposed by MedSAM \cite{ma2023segment}.

\begin{figure}[htp]
    \centering
    \includegraphics[trim=0 5cm 6cm 0, clip, width=\textwidth]{images/sam_architecture.pdf}
    \caption{\method{} model for medical image segmentation using a ResNet-50 as an Adapter to build the input prompt for the base SAM backbone.}
    \label{fig:sam_baseline}
\end{figure}

A key component of \method{} lies in the continuous adaptation of the ResNet-50 adapter, which plays a crucial role in guiding the segmentation process by generating adaptive prompts. These adaptive prompts are based on the embeddings extracted by the ViT feature extractor. By incorporating a transposed convolutional layer, the ResNet-50 adapter effectively translates the embeddings into actionable prompts, improving segmentation accuracy. Based on the extensive ablations of SAM in radiology \cite{ranem2023exploring}, the proposed adapter is designed to predict 100 2D points and four coordinates representing a bounding box.

Moreover, \method{} is designed to adapt to changing domain characteristics over time. While SAM's basic components provide a strong foundation for segmentation tasks, \method{}'s continual prompting mechanism ensures that the model can dynamically adjust to evolving datasets and domain shifts. This adaptability is essential for maintaining high performance across diverse medical imaging environments.

In summary, \textit{\method{}'s architectural composition integrates the robustness of pre-trained SAM and the feature extraction capabilities of ResNet-50}, forming a comprehensive base model for domain adaptation in medical image segmentation over time. By strategically freezing SAM's ViT backbone, coupled with the ResNet-50 Adapter, \method{} is the first method to effectively combine the strengths of both a foundation model and CL to achieve accurate and adaptable segmentation results for medical segmentation tasks.

\section{Experimental Setup}
% In this section, we provide a concise overview of our collection of publicly accessible datasets and outline key aspects of our experimental configuration.

\paragraph{Datasets}
We explore the problem of continual image segmentation for prostate MRIs. To ensure reproducibility, we use only openly available datasets, whereas every data source acts as one task $\{\mathcal{T}_1, \dots, \mathcal{T}_{n}\}$. Table \ref{tab:data} provides a summary of the core characteristics of the data.

% We explore the problem of continual image segmentation for two different use cases. To ensure reproducibility, we use only openly available datasets. For each anatomy, we select an array of data sources that act as our tasks $\{\mathcal{T}_1, \dots, \mathcal{T}_{n}\}$. Table \ref{tab:data} provides a summary of core characteristics of the data.

% \begin{table}[htp]
% \centering
% \begin{adjustbox}{max width=\linewidth}{
% \begin{tabular}{ccccc|ccc}
% \hline
% \multirow{2}{*}{\textbf{Dataset}}        & \multicolumn{4}{c|}{\textbf{Prostate}}                        & \multicolumn{3}{c}{\textbf{Hippocampus}}                           \\ \cline{2-8} 
% & UCL               & I2CVB             & ISBI              & DecathProst       & HarP              & Dryad             & DecathHip         \\ \hline \hline
% \multicolumn{1}{l|}{\textbf{\# Cases}}   & 13                & 19                & 30                & 32                & 270               & 50                & 260               \\
% \multicolumn{1}{l|}{\textbf{Resolution}} & {[}24 384 384{]}  & {[}64 384 384{]}  & {[}19 384 384{]}  & {[}19 316 316{]}  & {[}48 64 64{]}    & {[}48 64 64{]}    & {[}36 50 35{]}    \\
% \multicolumn{1}{l|}{\textbf{Spacing}}    & {[}3.3 0.5 0.5{]} & {[}1.3 0.5 0.4{]} & {[}3.7 0.5 0.5{]} & {[}1.0 1.0 1.0{]} & {[}1.0 1.0 1.0{]} & {[}1.0 1.0 1.0{]} & {[}1.0 1.0 1.0{]}\\
% \bottomrule
% \end{tabular}}
% \end{adjustbox}
% \caption{Image and label characteristics of the used hippocampus and prostate datasets.}%; providing number of cases, mean resolution and spacing.}
% \label{tab:data}
% \end{table}

\begin{table}[htp]
\centering
\begin{adjustbox}{max width=\linewidth}{
\begin{tabular}{ccccc}
\hline
\multicolumn{1}{l}{Dataset} & UCL               & I2CVB             & ISBI              & DecathProst \\ \hline \hline
\multicolumn{1}{l|}{\# Cases}   & 13                & 19                & 30                & 32   \\
\multicolumn{1}{l|}{Resolution} & {[}24 384 384{]}  & {[}64 384 384{]}  & {[}19 384 384{]}  & {[}19 316 316{]}  \\
\multicolumn{1}{l|}{Spacing}    & {[}3.3 0.5 0.5{]} & {[}1.3 0.5 0.4{]} & {[}3.7 0.5 0.5{]} & {[}1.0 1.0 1.0{]} \\
\bottomrule
\end{tabular}}
\end{adjustbox}
\caption{Image and label characteristics of the used prostate datasets.}
\label{tab:data}
\end{table}

The prostate data corpus consists of four publicly available T2-weighted MRI datasets as provided in the Multi-site Dataset for Prostate MRI Segmentation Challenge for sites A (ISBI), C (I2CVB) and D (UCL) and DecathProst from the Medical Segmentation Decathlon \cite{litjens2014evaluation, NCI-ISBI, lemaitre2015computer, liu2020saml, liu2020ms, antonelli2021medical}. For all datasets, we randomly divide $20\%$ of the data for test purposes and maintain this split across all experiments.

% The hippcampus data corpus consists of three publicly available T1-weighted MRI datasets containing senior healthy subjects, patients with Alzheimer’s disease and schizophrenia patients; Harmonized Hippocampal Protocol data \cite{boccardi2015training} (HarP), Dryad \cite{kulaga2015multi} and DecathHip from the Medical Segmentation Decathlon \cite{antonelli2021medical}.
% We select these two problem settings to ensure variability across modality, shape and size of the segmentation masks and difficulty of the task at hand. 
% For all datasets, we randomly divide $20\%$ of the data for test purposes and maintain this split across all experiments.


\paragraph{Training setup}
All nnU-Net \cite{isensee2021nnu} experiments train for 250 epochs with 250 steps each using the Lifelong nnU-Net framework \cite{gonzalez2023lifelong} with default optimizer and scheduler. SAM experiments also run for 250 epochs using Adam optimizer with weight decay of $1e^{-4}$, learning rates of $1e^{-4}$ and $1e^{-3}$ for the SAM segmentation head and the Adapter respectively.  Our loss function combines Dice-Cross-Entropy (DCE, $\mathcal{L}_{DCE}$) from \cite{the_monai_consortium_2020_4323059} and Mean-Squared-Error (MSE, $\mathcal{L}_{MSE}$) with early stopping (patience of 15). The MSE loss-term is used for predicted samples ($\mathcal{L}_{MSE}^{samples}$) and bounding box coordinates ($\mathcal{L}_{MSE}^{BBox}$): $\mathcal{L}_{SAM} = \mathcal{L}_{DCE} + \mathcal{L}_{MSE}^{samples} + \mathcal{L}_{MSE}^{BBox}$. 
% Each model trains individually for each task in the dataset to assess generalizability across tasks, crucial for continuous setups.
All models are trained on a single NVIDIA A40 GPU (48 GB).

\paragraph{Metrics}
For every CL setup, we report the mean Dice and standard deviation across the test images from all tasks $\{\mathcal{T}_i\}_{i \leq \lvert \Omega_{\mathcal{T}} \rvert}$ as well as average forwards (FWT) and backwards (BWT) transferability \cite{diaz2018don}. FWT measures the impact of the current training stage $\{\mathcal{T}_i\}_{i \leq \lvert \Omega_{\mathcal{T}} \rvert}$ on test data from an untrained stage $\mathcal{T}_j\; ; \;j > i$. BWT, on the other hand, indicates the amount of maintained knowledge on test samples from $\mathcal{T}_j$ during training on different stages $\{\mathcal{T}_i\}_{i \leq \lvert \Omega_{\mathcal{T}} \rvert}\; ; \;j < i$ over time. Models that achieve a higher FWT have high plasticity and are able to learn new knowledge, while models with a higher BWT maintain most knowledge from previous tasks, i.e. prevent catastrophic forgetting. More information on the CL metrics can be found in the Appendix \ref{app:metrics}.


\paragraph{Baselines}
To get a proper evaluation of our approach, we compare against conventional sequential training, rehearsal training, and two well-known CL methods: EWC \cite{kirkpatrick2017overcoming} and RWalk \cite{chaudhry2018riemannian}. For both CL methods we are inspired by the Lifelong nnU-Net \cite{gonzalez2023lifelong} hyperparameter setup for all our experiments (EWC: $\lambda = 0.4$, RWalk: $\alpha = 0.9, \lambda = 0.4$). %Rehearsal methods store and interleave samples from earlier tasks to have a representation of the trained domains over time to maintain proper performance over time.


\section{Results}

\subsection{Continual learning performance}
\label{sec:cl_res}
In this section, we compare \method{} with sequential Lifelong nnU-Net and two established CL methods -- EWC and RWalk. Additionally, we evaluate against the upper bound of rehearsal training, which involves storing randomly $20\%$ from each task. Rehearsal serves as an upper bound for Lifelong nnU-Net but is impractical due to privacy policy constraints on storing patient images.

\begin{table}[htb]
\centering
\begin{adjustbox}{max width=\linewidth}
{\begin{tabular}{lccccc|cc}
\toprule
Method & Fixed param & Tuned param & Dice $\uparrow$ [\%] & BWT $\uparrow$ [\%] & FWT $\uparrow$ [\%] & \# Epochs $\downarrow$ & Runtime $\downarrow$ [sec]\\ \midrule \midrule
$\text{Sequential}_{\text{nnU-Net}}$ & \multirow{2}{*}{--} & \multirow{2}{*}{--} & $49.44 \pm 28.82$ & $-52.96 \pm 15.07$ & $-52.81 \pm 5.05$ & 1000 & 193 \\
$\text{Sequential}_{\text{\method{}}}$ & & & $\mathbf{78.38 \pm 11.67}$ & $\mathbf{-14.27 \pm 8.91}$ & $\mathbf{-21.27 \pm 10.96}$ & \textbf{113} & \textbf{43} \\
\midrule
$\text{EWC}_{\text{nnU-Net}}$ & \multirow{2}{*}{--} & \multirow{2}{*}{$\lambda = 0.4$} & $39.34 \pm 32.03$ & $-46.77 \pm 12.16$ & $-52.72 \pm 16.90$  & 1000 & 200 \\
$\text{EWC}_{\text{\method{}}}$ & & & $77.77 \pm 12.16$ & $-16.85 \pm 10.13$ & $-22.40 \pm 10.86$ & 123 & 46 \\
\midrule
$\text{RWalk}_{\text{nnU-Net}}$ & \multirow{2}{*}{\shortstack{$\alpha = 0.9$}} & \multirow{2}{*}{$\lambda = 0.4$} & $52.48 \pm 26.19$ & $-48.62 \pm 13.42$ & $-48.73 \pm 9.52$ & 1000 & 196\\
$\text{RWalk}_{\text{\method{}}}$ & & & $77.31 \pm 13.42$ & $-16.08 \pm 12.86$ & $-23.17 \pm 11.41$ & 120 & 48 \\
\midrule
$\text{Rehearsal}_{\text{nnU-Net}}$ & -- & -- & $60.90 \pm 21.62$ & $-37.45 \pm 11.60$ & $-39.83 \pm 7.92$ & 1000 & 269 \\
\bottomrule
\end{tabular}}
\end{adjustbox}
\caption{CL performance of the final model; mean Dice, BWT and FWT over all tasks including standard deviation, total amount of trained epochs and average runtime per epoch in seconds; best values are marked in bold.}
\label{tab:ps}
\end{table}

Table \ref{tab:ps} and Figure \ref{fig:spiders} show that Lifelong nnU-Net achieves certain benefits depending on which CL method is used, however gets significantly outperformed by \method{}.

\begin{figure}[htp]
    \centering
    \includegraphics[trim=0 5cm 7cm 0, clip, width=0.95\textwidth]{images/spiders.pdf}
    \caption{Segmentation performance as Dice using different CL methods for \method{} and Lifelong nnU-Net; the larger the area the better.}
    \label{fig:spiders}
\end{figure}

\method{} demonstrates superiority over Lifelong nnU-Net as it \textit{successfully leverages the rich knowledge base embedded in the foundation model}, enabling robust adaptation to domain shifts within the data. The $23\%$ and $18\%$ performance increase for BWT and FWT, compared to the rehearsal upper bound, contrasts with traditional methods, which \textit{struggle to handle domain variations effectively} over time.
% The performance increase of $23\%$ and $18\%$ for BWT and FWT correspondingly compared to the upper bound contrasts with the limitations observed in traditional methods, including advanced frameworks like the Lifelong nnU-Net, which tend to fail in effectively managing domain variations over time for medical tasks.

\subsection{Qualitative temporal evaluation}
To analyze the robustness of our proposed method, we illustrate segmentation masks in Figure \ref{fig:temp_res} for \method{} and Lifelong nnU-Net using EWC, RWalk and rehearsal.

\method{} consistently generates coherent segmentation masks throughout all training stages. In contrast, EWC and rehearsal training for Lifelong nnU-Nets result in low-quality segmentations after training on the last stage 4 $\{\mathcal{T}_{4}\}$. The reduced performance on the sample scan for later stages illustrates the impact of catastrophic forgetting, where the network excessively adapts to the most recent training data, i.e. being too plastic. \method{} avoids being either overly rigid or plastic, by achieving a proper balance, providing robust predictions that maintain quality across both early and later training stages.

\begin{figure}[htp!]
    \centering
    \includegraphics[trim=0 1.5cm 8cm 0.35cm, clip, width=0.95\textwidth]{images/qual_fig.pdf}
    \caption{Temporal analysis for sequential, EWC, Rehearsal and \method{} using Case 14, Slice 14 (37) from $\mathcal{T}_2$.}
    \label{fig:temp_res}
\end{figure}

\subsection{\method{} static comparison}
In a static training condition, a model is trained on one single task and validated across all existing tasks, as shown in Table \ref{tab:baselines}. Direct comparison between \method{} and nnU-Net segmentation under static training provide direct insights into a method's generalizability and ability to handle diverse tasks after training on a singular dataset.

\begin{table}[htp]
\begin{center}
\begin{adjustbox}{max width=\linewidth}
{\begin{tabular}{ccccccc}
\toprule
& & \multirow{2}{*}{Trained on} & \multicolumn{4}{c}{Tested on -- Dice $\uparrow{ } \pm{ } $ $\sigma \downarrow $ {[}\%{]}} \\ \cmidrule{4-7}
& & & UCL & I2CVB & ISBI & DecathProst \\ \midrule \midrule
\parbox[t]{2mm}{\multirow{4}{*}{\rotatebox[origin=c]{90}{nnU-Net}}} & & UCL & $\mathbf{85.47 \pm 6.92}$ & $23.24 \pm 16.8$ & $81.47 \pm 10.7$ & $9.68 \pm 11.4$ \\
& & I2CVB & $57.11 \pm 7.57$ & $\mathbf{83.06 \pm 0.28}$ & $45.73 \pm 20.2$ & $1.30 \pm 1.53$ \\
& & ISBI & $81.78 \pm 6.15$ & $29.06 \pm 17.6$ & $\mathbf{93.00 \pm 1.46}$ & $52.48 \pm 27.5$ \\
& & DecathProst & $25.24 \pm 25.1$ & $27.57 \pm 1.89$ & $59.20 \pm 16.3$ & $\mathbf{89.25 \pm 1.78}$ \\ \cmidrule{1-7}
\parbox[t]{2mm}{\multirow{4}{*}{\rotatebox[origin=c]{90}{\shortstack{UnCLe \\ SAM}}}} & & UCL & $\mathbf{85.29 \pm 3.59}$ & $51.53 \pm 36.1$ & $85.57 \pm 6.61$ & $44.57 \pm 21.8$ \\
& & I2CVB & $85.99 \pm 1.83$ & $\mathbf{88.11 \pm 3.20}$ & $83.55 \pm 8.42$ & $80.16 \pm 6.52$ \\
& & ISBI & $84.68 \pm 0.93$ & $54.84 \pm 38.8$ & $\mathbf{96.02 \pm 1.04}$ & $82.56 \pm 3.79$ \\
& & DecathProst & $81.55 \pm 8.66$ & $59.94 \pm 36.4$ & $79.77 \pm 15.1$ & $\mathbf{92.27 \pm 0.97}$ \\
\bottomrule
\end{tabular}}
\end{adjustbox}
\caption{Results for nnU-Net and \method{} networks trained on every task individually and evaluated across all tasks; Bold values indicate the performance of the baseline on the validation set of the task it has been trained on.}
\label{tab:baselines}
\end{center}
\end{table}

Table \ref{tab:baselines} demonstrates the \textit{superior performance and greater generalizability of \method{} compared to nnU-Net} for prostate segmentation. The method consistently achieves higher Dice scores across diverse datasets, achieving proper generalizability, highlighting \method{}'s robustness and adaptability. It showcases promising potential for domain adaptation and continual learning in medical image segmentation, providing a stable and adaptable solution across diverse datasets. For additional results, we refer the reader to Appendix \ref{app:base_models}.

\section{Conclusion}
We propose \textit{\method{}}, a \textit{novel approach} that leverages the knowledge base of the pre-trained SAM \textit{foundation model} to address domain adaptation challenges in \textit{continual medical image segmentation}. By leveraging SAM's robust capabilities, our method achieves \textit{superior adaptability and performance} compared to traditional U-Net architectures like the Lifelong nnU-Net framework. Through extensive evaluation of a set of four different prostate datasets, \method{} demonstrates its effectiveness in maintaining knowledge from early stages while adapting to evolving datasets over time. Our approach not only \textit{outperforms existing methods} in terms of segmentation accuracy in a continuous setup but also offers a more generalizable solution, showcasing a significant performance improvement even when trained statically with data from a single site. \method{} paves the way for a \textit{balanced approach between rigidity and plasticity} in continual learning setups without using actual CL methods like EWC while achieving better results than rehearsal. By releasing our code base, we hope to inspire research in CL that goes beyond traditional U-Net-based segmentation for medical settings by leveraging the knowledge base of foundation models like SAM.

% \section{Reproducibility}
% All datasets used in this manuscript are publicly available under the corresponding citations. Trained networks and instructions on how to run all experiments, will be made publicly available under \url{https://github.com/anon}.

% Acknowledgments -- Will not appear in anonymized version
% \midlacknowledgments{Put some Acknowledgments here if needed.}

\clearpage

\bibliography{midl24_028}

\clearpage

\appendix

\section{Base model performance}
\label{app:base_models}
Table \ref{tab:baselines} from the main manuscript provides the Dice scores with standard deviation for every trained baseline evaluated across all tasks. Figure \ref{fig:conf_matrix} visualizes them in form of confusion matrices.

\begin{figure}[htp]
    \centering
    \includegraphics[trim=0 3cm 0 0, clip, width=\textwidth]{images/confusion_matrix.pdf}
    \caption{Confusion matrices based on Dice score for \method{} (left) and nnU-Net (right) across different datasets.}
    \label{fig:conf_matrix}
\end{figure}

\section{Continual learning metrics}
\label{app:metrics}
In this work, BWT and FWT are defined as follows \cite{diaz2018don}. Let $\mathcal{T}_{i}$ be a specific task:


FWT is defined as
\begin{align}
\label{eqn:F}
    \text{FWT}\left( \mathcal{T}_{i}\right) &= \text{Dice}\left(\mathcal{M}_{\left[ \mathcal{T}_{1}, \dots, \mathcal{T}_{i-1}\right]}, \mathcal{T}_{i}\right) - \text{Dice}\left(\mathcal{M}_{\left[\mathcal{T}_{i}\right]}, \mathcal{T}_{i}\right),
\end{align}

where $\mathcal{M}_{\left[ \mathcal{T}_{1}, \dots, \mathcal{T}_{i}\right]}$ is a network trained on stages $\{1, \dots, p\} \leq \lvert \Omega_{\mathcal{T}} \rvert$ and $\text{Dice}(\mathcal{M}_{\left[ \mathcal{T}_{1}, \dots, \mathcal{T}_{j}\right]}, \mathcal{T}_{i})$ indicates the S\o{}rensen–Dice coefficient from a network trained on stages $\{1, \dots, j\}$ evaluated on dataset $p$.

BWT is defined as
\begin{align}
\label{eqn:B}
    \text{BWT}\left( \mathcal{T}_{i}\right) &=
    \text{Dice}\left(\mathcal{M}_{\left[ \mathcal{T}_{1}, \dots, \mathcal{T}_{i}, \dots, \mathcal{T}_{n}\right]}, \mathcal{T}_{i}\right) - \text{Dice}\left(\mathcal{M}_{\left[ \mathcal{T}_{1}, \dots, \mathcal{T}_{i}\right]}, \mathcal{T}_{i}\right),
\end{align}

FWT for the last model state as well as BWT for the first model state is not defined.

\section{Continual learning performance}
Table \ref{tab:ps} from the main manuscript provides the CL performance over all used methods using mean Dice, BWT and FWT. Table \ref{tab:cl_baselines} provides the actual Dice scores and standard deviation of each method across all four tasks after the network was trained on all stages in a continous manner.



\begin{table}[htp]
\begin{center}
\begin{adjustbox}{max width=\linewidth}
{\begin{tabular}{lccccc}
\toprule
\multirow{2}{*}{Method}  & \multicolumn{5}{c}{Tested on -- Dice $\uparrow{ } \pm{ } $ $\sigma \downarrow $ {[}\%{]}} \\ \cmidrule{3-6}
& & UCL & I2CVB & ISBI & DecathProst \\ \midrule \midrule
$\text{Sequential}_{\text{nnU-Net}}$ &  & $25.06 \pm 5.61$ & $20.22 \pm 3.95$ & $61.40 \pm 15.7$ & $91.06 \pm 1.61$ \\
$\text{Sequential}_{\text{\method{}}}$ &  & $81.55 \pm 8.66$ & $\mathbf{59.94 \pm 36.4}$ & $79.77 \pm 15.1$ & $\mathbf{92.27 \pm 0.97}$ \\
\hline
$\text{EWC}_{\text{nnU-Net}}$ &  & $21.98 \pm 3.92$ & $2.17 \pm 2.57$ & $45.01 \pm 24.9$ & $88.20 \pm 1.30$ \\
$\text{EWC}_{\text{\method{}}}$ &  & $78.36 \pm 8.57$ & $58.26 \pm 36.5$ & $83.19 \pm 10.7$ & $91.25 \pm 1.01$ \\
\hline
$\text{RWalk}_{\text{nnU-Net}}$ &  & $36.86 \pm 7.82$ & $21.67 \pm 4.56$ & $60.33 \pm 18.9$ & $91.07 \pm 1.54$ \\
$\text{RWalk}_{\text{\method{}}}$ &  & $\mathbf{81.56 \pm 4.95}$ & $54.51 \pm 38.5$ & $\mathbf{84.34 \pm 5.47}$ & $88.85 \pm 3.20$ \\
\hline
$\text{Rehearsal}_{\text{nnU-Net}}$ &  & $35.39 \pm 30.7$ & $46.59 \pm 3.03$ & $70.30 \pm 16.00$ & $91.33 \pm 1.45$ \\
\bottomrule
\end{tabular}}
\end{adjustbox}
\caption{Results for nnU-Net and \method{} final networks trained all tasks sequentially, with EWC and RWalk, evaluated across all tasks; Bold values indicate the best performance.}
\label{tab:cl_baselines}
\end{center}
\end{table}

\section{Workflow comparison of models}
\begin{figure}[htp]
    \centering
    \includegraphics[trim=0 4cm 13cm 0, clip, width=0.8\textwidth]{images/sam_medsam.pdf}
    \caption{Comparison of workflow methodologies for SAM, MedSAM, and \method{}. SAM and MedSAM adopt a centralized training approach, whereas SAM fails to perform good in medical use cases. In contrast, \method{} utilizes a continuous training paradigm, facilitating CL.}
    \label{fig:com}
\end{figure}
Figure \ref{fig:com} illustrates the workflow process for SAM, MedSAM and \method{}, highlighting their distinct training methodologies. SAM and MedSAM are trained centralized from scratch, while having an increased CO2 emission. Additionally, having centralized training of MedSAM may inadvertently contain samples from publicly used datasets, resulting in data leakage and privacy violations in CL comparisons. In contrast, \method{} utilizes a continuous training paradigm, minimizing CO2 emissions by leveraging the pre-trained SAM model and enabling adaptation to evolving data leveraging the pre-trained SAM.

% \let\clearpage\relax
\end{document}