%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[!t]
\centering
\includegraphics[scale=0.45]{IMAGES/CL_yousef-compact.pdf}
\caption{Overview of the CLMU-Net framework. 
}
\label{fig:proposedFramework}
\end{figure}


\section{Methodology}

Our goal is to develop a continual brain-lesion segmentation model capable of learning from sequentially arriving datasets, also referred to as episodes or tasks, while remaining robust to shifts in pathology, acquisition sites, and modality availability. In this domain incremental CL setup, the model encounters a sequence of $T$ episodes, ${D_1, \dots, D_T}$, each arriving one after another. Our proposed \textbf{CLMU-Net} is built around three synergistic components (Fig.~\ref{fig:proposedFramework}). First, a modality-flexible input interface accommodates arbitrary and evolving modality sets through dynamic channel inflation and Random Modality Drop (RMD), ensuring generalization and stable performance under heterogeneous imaging protocols. Second, a Domain-Conditioned Textual Guidance (DCTG) module injects global pathology and modality context at the U-Net bottleneck. Each 3D image patch from a patient's image data is paired with a language model derived domain embedding, generated from a prompt describing its lesion type and available modalities, and this embedding interacts with bottleneck features via a cross-attention to produce globally informed, global context-aware representations. Third, a lesion-aware experience replay mechanism maintains long-term stability by storing a balanced mixture of representative and difficult samples from each dataset. 
After completing $t^{th}$ training session with the current dataset ($D_t$) and replay buffer ($\mathcal{B}^{global}_t$), each sample from the current dataset ($D_t$) are ranked using these complementary criteria, and the top-scoring ones are inserted into a fixed-size global buffer, ensuring that future replay batches include both stable anchors of the distribution and challenging cases most susceptible to forgetting. Together, these components enable CLMU-Net to continually acquire new knowledge while preserving past performance, even under shifting modality configurations and non-stationary clinical data streams. 


\begin{figure}[!t]
    \centering
    \includegraphics[scale=0.57]{IMAGES/dynamic3.pdf}
    \caption{Modality-flexible design: varying episode-wise modalities (top), channel inflation for new modalities (middle), and RMD for modality-agnostic training (bottom).
    }
    \label{fig:dynamicArch}
\end{figure}

\subsection{Buffer Selection Criteria}\label{sec:bufferSelection}
Experience replay plays a central role in CLMU-Net, and its effectiveness depends critically on which samples are retained in the replay buffer. Randomly selected samples~\cite{rolnick2019experience} for buffer can lead to suboptimal or under-utilization of limited buffer capacity.
Instead, we employ a lesion-aware selection strategy that balances two complementary categories: \emph{representative} samples that anchor the dominant lesion distribution of past datasets, and \emph{difficult} samples that capture boundary ambiguity and morphological variability. This ensures that the buffer preserves both the stable core of each dataset and the challenging cases most susceptible to forgetting.

\paragraph{Representative samples:}
Representative samples are defined as cases for which the model produces confident lesion predictions and that contain sufficient lesion volume to reflect typical pathology. Such cases anchor the replay buffer to the dominant lesion distribution of each dataset, ensuring adequate coverage of common lesion patterns and supporting stable knowledge retention across datasets. We quantify representativeness using two complementary measures: lesion prediction confidence and lesion size.

For each 3D MRI volume $i$, the network outputs a voxel-wise softmax probability $\hat{p}^{(i)}(v) \in [0,1]$ for the lesion class at each voxel $v$, and the corresponding ground-truth annotation is denoted by $G^{(i)}(v) \in \{0,1\}$. Let $L^{(i)} = \{ v \mid G^{(i)}(v) = 1 \}$ be the set of lesion voxels and let $\tau = 0.5$ be a confidence threshold. To ensure that the confidence score reflects reliable predictions, only lesion voxels with sufficiently high predicted lesion probability contribute positively. Concretely, we define for each lesion voxel
$s^{(i)}(v) = \hat{p}^{(i)}(v)$ if $\hat{p}^{(i)}(v) > \tau$ and $s^{(i)}(v) = 0$ otherwise.
The sample-level confidence score is then given by
$S_{\text{conf}}^{(i)} = \frac{1}{|L^{(i)}|} \sum_{v \in L^{(i)}} s^{(i)}(v)$.
Higher values indicate that a larger fraction of lesion voxels are segmented with high confidence, suggesting that the model has well internalized the lesion appearance for that case. Lesion size provides a second signal of representativeness, as larger lesions offer richer and more diverse supervision. Volumes with larger lesions contribute more positive voxels and better capture the main structure of the pathology, reducing the risk that the buffer is dominated by cases with very small lesions or mostly background. For each sample $i$, we therefore define the lesion size score as $S_{\text{size}}^{(i)} = |L^{(i)}|$, that is, the number of lesion voxels in the volume. In practice, both scores are normalized across the dataset and combined into a single representativeness score $R_{\text{rep}}^{(i)} = (1 - \alpha)\,{S}_{\text{conf}}^{(i)} + \alpha \,{S}_{\text{size}}^{(i)}$ using a weighting factor $\alpha \in [0,1]$ that balances the relative importance of prediction confidence and lesion volume when ranking samples for inclusion in the replay buffer.

\paragraph{Difficult samples:}
Difficult cases emphasize boundary ambiguity and irregular morphologies, which are typically the first regions to degrade under distribution shift. Retaining such cases in the buffer ensures that the model remains exposed to challenging lesion structures, thereby improving robustness across sequential tasks. We quantify difficulty using two complementary measures: boundary uncertainty and lesion complexity. Boundary uncertainty characterizes how unstable the model's predictions are near the lesion margin. For each voxel, the network outputs a foreground probability $\hat{p}(v)$, and uncertainty is assessed by evaluating how close these probabilities lie to the decision threshold. Probabilities near this threshold indicate ambiguity, whereas probabilities far from it indicate confident separation of lesion and background. To focus on the most informative region, uncertainty is computed only within a symmetric 3D boundary band of fixed total width of $9$ voxels, constructed by expanding the lesion surface by $4$ voxels inward and $4$ voxels outward. The sample-level boundary uncertainty score is then defined as
$
S_{\text{unc}}^{(i)} = \frac{1}{|B^{(i)}|}\sum_{v\in B^{(i)}} \bigl\lvert \hat{p}(v) - 0.5 \bigr\rvert ,
$
where $B^{(i)}$ denotes the boundary band for sample $i$. Smaller values indicate greater prediction instability along the lesion margin, making such cases more difficult and more susceptible to forgetting. Further, lesions with fragmented or irregular morphology also pose challenges for sequential learning. We quantify this property using the lesion complexity score \(S_{\text{comp}}^{(i)} = (C^{(i)})^2 / N^{(i)}\), where $C^{(i)}$ is the number of connected components and $N^{(i)}$ the number of lesion voxels. Higher values correspond to more irregular or scattered lesions. These two difficulty indicators are combined into a final difficulty score $R_{\text{diff}}^{(i)}$ using a weighting factor $\gamma\in[0,1]$, enabling the sample ranking to reflect both boundary ambiguity and morphological fragmentation.



\subsection{Final Buffer Composition and Management}

CLMU-Net maintains a global replay buffer which is sequentially updated. After completing the \(t\)-th training session, we compute \(R_{\text{rep}}\) and \(R_{\text{diff}}\) for all samples in \(D_t\) and select the top-ranked volumes in equal proportion from both categories to form the dataset-specific partition \(\mathcal{B}_t\). The global buffer is updated by inserting \(\mathcal{B}_t\) into the existing buffer that contains partitions from previously seen datasets. Formally, the global buffer after session \(t\) is
$
\mathcal{B}^{global}_t = \mathcal{B}^{global}_{t-1} \cup \mathcal{B}_t,
$
with \(\mathcal{B}^{global}_0 = \varnothing\). The buffer has fixed capacity \(\beta\), so after insertion we remove samples to satisfy \(\sum_{i=1}^{t} |\mathcal{B}_i| = \beta\). 


In practice, a simple and effective policy is to maintain approximate parity across partitions by setting
\(|\mathcal{B}_i| \approx \beta/t\).
Importantly, eviction is performed within each partition rather than by comparing ranks across datasets: for any seen partition (i.e., \mbox{\(i\leq t\)}), we evict samples if it exceeds $\beta/t$. To preserve the fixed balance between representative and difficult samples, eviction is applied category-wise by removing the lowest-ranked samples within each subset of \mbox{\(\mathcal{B}_i\)}. Consequently, each seen dataset retains a reserved share of the buffer by construction, and a dataset can not be eliminated due to cross-cohort differences in the scale or distribution of \mbox{\(R_{\text{rep}}\)} and \mbox{\(R_{\text{diff}}\)}.
This prevents domination by larger datasets and preserves intra-dataset diversity.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Domain-conditioned Textual Guidance}
As illustrated in the DCTG block of Fig.~\ref{fig:proposedFramework}, global domain knowledge is injected into the inherently local U\text{-}Net bottleneck representation through a text-guided multi-head cross-attention module. For each training case, a short textual description is composed that specifies the lesion type and the set of available MRI modalities (the example in Fig.~\ref{fig:proposedFramework}). 
We adopt a prompt-based representation to avoid assuming a fixed, known-in-advance vocabulary of modalities and lesion descriptors: new modality subsets or lesion types can be expressed in text without redefining the conditioning dimensionality.
This text is encoded with a pretrained biomedical language model, BioBERT~\cite{lee2020biobert}, yielding contextual token embeddings $\mathcal{T} \in \mathbb{R}^{B \times N_t \times 768}$. 

We keep the text encoder frozen so that the mapping from a given (lesion type, modality subset) description to its embedding is consistent across continual sessions, providing a stable conditioning signal when samples are revisited under replay.
The tokens are mapped to the visual feature dimension, resulting in $\tilde{\mathcal{T}} \in \mathbb{R}^{B \times N_t \times d}$ with $d = 256$ and $N_t = 64$. In parallel, the U\text{-}Net bottleneck feature map $F \in \mathbb{R}^{B \times C \times H \times W \times D}$ with $C = 256$ is reshaped into a sequence of image tokens $X \in \mathbb{R}^{B \times N_i \times C}$, where $N_i = HWD$. These tokens are linearly projected to $d$ channels, added to a learned positional embedding $P \in \mathbb{R}^{1 \times N_i \times d}$, and normalized, producing $\tilde{X} \in \mathbb{R}^{B \times N_i \times d}$, which serves as the query sequence.

Multi-head cross-attention is then applied with queries derived from the image tokens and keys/values derived from the text tokens: $Q = \tilde{X} W_q$, $K = \tilde{\mathcal{T}} W_k$, and $V = \tilde{\mathcal{T}} W_v$, where each projection is factorized into $h$ heads of dimension $d/h$ ,where $h = 8$. Scaled dot-product attention is computed independently in each head and concatenated, yielding text-conditioned image embeddings $Y \in \mathbb{R}^{B \times N_i \times d}$. A linear projection maps $Y$ back to the original bottleneck channel dimension, after which a residual connection with the original bottleneck tokens and a final layer normalization are applied. The sequence is then reshaped to the original spatial layout, producing the refined bottleneck tensor $F_{\text{DCTG}} \in \mathbb{R}^{B \times C \times H \times W \times D}$. This design enables the bottleneck features to be modulated by cohort-level priors encoded in the textual prompt, such as the expected lesion category and modality configuration, so that the decoder receives locally detailed yet globally informed representations, improving robustness under heterogeneous MRI acquisition protocols.


\subsection{Modality-flexible Segmentation}\label{sec:dynamic}
Recent works~\cite{xu2024feasibility,sadegheih2025modality} which facilitate a single U-Net model for multiple brain MRI datasets with heterogenous modality sets assume a fixed maximum number of input channels and represent unavailable modalities with zero-filled placeholders. Although simple, this rigid design limits generalization, since a hospital may acquire a novel modality not considered in this fixed set. A clinically deployable model must accommodate such variability without need to predefine modality layouts. To support arbitrary and evolving modality sets, we equip CLMU-Net with a modality-flexible input layer via channel inflation (Fig.~\ref{fig:dynamicArch}, middle). At episode \(t\), the input convolution expands to match the maximum number of modalities observed so far, replacing the original \(K_{\max}(t-1)\)-channel layer with a \(K_{\max}(t)\)-channel layer. 
The computation cost along the CL trajectory is always upper-bounded by a model that fixes its input configuration to \mbox{\(K_{\max}(T)\)} from the start. Thus, for any episode where \mbox{\(K_{\max}(t)\)} $<$ \mbox{\(K_{\max}(T)\)}, channel inflation yields marginally lower computation.
Newly added channels are zero-initialized, and pretrained weights are copied into the first \(K\) channels, allowing seamless continuation of learned representations. Each sample is then mapped to a \(K_{\max}(t)\)-channel tensor by inserting zero-valued channels for any absent modality. 



Further, to improve generalization and reduce spurious correlations between datasets and specific sequences, we adopt RMD~\cite{xu2024feasibility,sadegheih2025modality} (Fig.~\ref{fig:dynamicArch}, bottom). During training, available modalities are randomly masked for both current and replay samples, exposing the model to diverse modality combinations and encouraging redundancy-aware, modality-agnostic features. Together, channel inflation and RMD allow CLMU-Net to operate reliably under arbitrary, incomplete, or newly introduced modality configurations, enabling CL across evolving clinical protocols.

