%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments}
\subsection{Datasets, Experimental Setup, Evaluation Metrics, CL Benchmarks}
\noindent \textbf{Datasets:} We evaluate and compare CLMU-Net on five heterogeneous 3D brain MRI datasets: BRATS-Decathlon~\cite{bakas2017advancing}, ISLES~\cite{maier2017isles}, MSSEG~\cite{commowick2018objective}, ATLAS~\cite{liew2022large}, and WMH~\cite{AECRSD_2022}, covering a wide range of modalities (T1, T1c, T2, PD, DWI, FLAIR), lesion types (tumor, stroke, sclerosis, and white matter hyperintensities), and acquisition centers. Each dataset is treated as a separate episode, arriving sequentially in a domain-incremental CL setting. We tested on two dataset sequences: $S1$ (BRATS-Decathlon, ATLAS, MSSEG, ISLES, WMH) representing large to small dataset sizes and $S2$ (MSSEG, BRATS-Decathlon, ISLES, WMH, ATLAS) representing descending modality counts~\cite{sadegheih2025modality}. Train-test split is followed from~\citet{sadegheih2025modality}.

\noindent \textbf{Experimental Setup:} All volumes are sampled to a common resolution ($1 mm$), skull-stripped, and z-score normalized per modality~\cite{sadegheih2025modality,xu2024feasibility}. During training, we use a patch-wise sampling strategy with size $128^3$ and batch size $2$. Optimization is performed using Adam with an initial learning rate of $1\times10^{-3}$. Each dataset is trained for $300$ epochs before moving to the next training session. We evaluate CLMU-Net and best performing buffer-based method (ER) on different $\beta$ ($\{10, 20, 30, 40\}$). $\alpha, \gamma$ are set as 0.9. All experiments are implemented in PyTorch~2.5 and run on a single NVIDIA H100 GPU with $92$\, GB memory; training a full task sequence requires approximately $41$ GPU hours.

\noindent \textbf{Evaluation Metrics:} We report the standard volumetric segmentation metric, Dice Similarity Coefficient (DSC), to evaluate model performance across tasks. To evaluate retention across sequential tasks, we adopt popular CL metrics as in previous literature~\cite{kumari2025continual,sadegheih2025modality}: average performance (AVG), incremental learning metric (ILM), backward transfer (BWT) \cite{lopez2017gradient}. Negative BWT reflects forgetting of earlier tasks, while positive values indicate knowledge retention or improvement. The higher the value of these metrics, the higher is the performance. 



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent \textbf{CL Benchmarks:} We benchmark CLMU-Net against several representative CL strategies. The lower bound (LB) performance is achieved by `naive', `fromScratchTraining' and upper bound (UB) by `cumulative' and `joint' methods. We consider frequently benchmarked buffer-free and buffer-based CL methods, also covering those considered in `Lifelong nnU-Net' \cite{gonzalez2023lifelong} (a recent benchmark framework for medical CL). 
Among buffer-free methods, we consider LFL \cite{jung2016less}, MAS \cite{aljundi2018memory}, EWC \cite{alvarez2025mitigating}, SI \cite{zenke2017continual}, LwF \cite{li2017learning}, MiB~\cite{cermelli2020modeling}, TED~\cite{zhu2024boosting}, and BrainCL \cite{sadegheih2025modality}.
For buffer-based approaches, we consider GEM~\cite{lopez2017gradient}, MIR~\cite{aljundi2019online}, GDumb~\cite{prabhu2020gdumb}, ER~\cite{rolnick2019experience}, RCLP~\cite{ceccon2025multi}. We follow the setup presented in BrainCL \cite{sadegheih2025modality} for re-implementation within the same training pipeline to ensure fair comparison. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%================================
\begin{figure}[!t]
    \centering
    \includegraphics[width=1\linewidth]{IMAGES/buffer_comparison.pdf}
    \caption{ER (dashed) vs. CLMU-Net (solid) across $\beta$ in $S1$, $S2$ 
    (left/right: AVG/ILM). }
    \label{fig:buffer_comparison}
\end{figure}
%================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Results}
\noindent \textbf{Comparison with other methods:} 
Table~\ref{tab:resultsTableMain} reports the performance of LB, UB, CLMU-Net, and representative CL methods across the five datasets in sequences $S1$, $S2$, and their mean. AVG and ILM summarize segmentation accuracy, while BWT quantifies forgetting. LB and UB are non-CL references included only for contextual comparison. The three metrics must be interpreted jointly. BWT alone can be misleading because it reflects only the change between the current and final session; for example, GDumb shows relatively strong BWT in $S1$ yet yields substantially lower AVG and ILM than other methods. AVG measures only the final-session DSC, whereas ILM captures the mean DSC over all sessions, offering a more complete view of stability. Therefore, conclusions rely on all three metrics. 

Buffer-free methods show pronounced degradation, with severe forgetting and substantially lower ILM and AVG values. CLMU-Net clearly outperforms the strongest buffer-free baseline (BrainCL), improving {AVG, ILM, BWT} by \{16.28\%, 14.06\%, 40.28\%\} in $S1$ and \{67.93\%, 32.38\%, 68.37\%\} in $S2$, despite using only ten stored past samples. This highlights the role of replay in mitigating forgetting complex brain lesion segmentation application. Among buffer-based approaches, CLMU-Net achieves the best AVG and ILM and the lowest forgetting. Relative to the strongest baseline in this category (ER), CLMU-Net improves {AVG, ILM, BWT} by  \{27.42\%, 9.95\%, 45.93\%\} in $S1$ and \{10.36\%, 15.36\%, 64.22\%\} in $S2$.

When textual guidance (DCTG) and input-layer inflation (ILI) are combined with replay, the hybrid variant outperforms using either component alone, indicating that these modules provide complementary benefits. The top two AVG and ILM scores (\textcolor{red}{red} and \textcolor{blue}{blue} in Table~\ref{tab:resultsTableMain}) across all CL methods are achieved by CLFU-Net variants, reflecting the strength of this design. 

Overall, the lesion-aware buffer strategy in CLMU-Net provides consistent gains across $S1$, $S2$, and their mean, surpassing both buffer-free and buffer-based baselines and demonstrating the advantage of coupling targeted replay with modality-flexible architecture and global textual guidance.



\noindent \textbf{Comparison with different buffer sizes:} 
Fig.~\ref{fig:buffer_comparison} compares CLMU-Net with the strongest baseline, ER, by reporting AVG and ILM across buffer sizes $\beta$ and sequences $S1$ and $S2$.
Both methods improve as $\beta$ increases, yet CLMU-Net consistently surpasses ER, with the largest gains appearing in the low-buffer regime ($\beta \leq 20$). 
Using the mean performance over $S1$ and $S2$ (green curves in Fig.~\ref{fig:buffer_comparison}), the relative gains of our method over ER across $\beta \in \{10, 20, 30, 40\}$ are \{21.51\%, 14.28\%, 9.15\%, 11.35\%\} for AVG and \{11.50\%, 5.14\%, 4.35\%, 1.66\%\} for ILM. These results indicate that the lesion-aware selection mechanism captures the underlying data distribution more effectively and provides more informative samples per memory budget, effectively mitigating forgetting even under extremely tight $\beta$.
