\section{Methodology}
\label{sec:methodology}

\subsection{Overview}

%ICL-NoiseUNet is a U-Net–shaped architecture that integrates two complementary  mechanisms: 
%(i) a Noise Modulation Block (NMB), which modulates intermediate feature activations using analytic noise descriptors derived from the target image, and
%(ii) In-Context Feature Conditioning module (ICFC), which fuses information from a small context set to guide the segmentation of the target image.
%The NMB refines the target feature representation by suppressing speckle and identifying high-contrast regions, thus improving anatomical boundary delineation. On the other hand, the ICFC module leverages the context set to guide the model towards expected anatomical structures. 
%Thus, at each stage, the feature maps are simultaneously context-guided and ultrasound-aware, leading to robust segmentation in challenging ultrasound conditions.

ICL-NoiseUNet follows a U-Net–shaped \cite{Ronneberger.2015} design that consists of four encoder blocks, one bottleneck block, four decoder blocks and a segmentation head. Each of these blocks processes the features of the target image through three sequential modules, which are the following:
\begin{enumerate}
\item \textbf{Feature Extraction Module:} Two consecutive Convolutional ($3\times3$ )–Batch Normalization–ReLU layers are utilized to extract feature maps $F_k^t$ at the block $k$ for the target input $x_t$. 
\item \textbf{Noise Modulation Block (NMB):} Residual and variance noise maps are combined at the block $k$ to form a modulation factor  $M_k$, which adapts features $F_k^t$ to ultrasound characteristics. More specifically, the NMB generates modulated feature representations $F_k^{t,mod}$ by suppressing speckle and identifying high-contrast regions. (details in Subsection~\ref{method:nmb}). 
\item \textbf{In-Context Feature Conditioning (ICFC):} Regarding context images, features $F_k^i$ are extracted at each block $k$ in parallel by a separate U-Net encoder–decoder backbone (without NMB or ICFC) that shares weights across all context images $x_i$. Then, refined target features $F_k^{t,mod}$ are fused with context features $F_k^i$ via channel concatenation, followed by shared $1 \times 1$ convolution and average pooling to obtain context-informed representations $\hat{F}_k^t$ (see Subsection~\ref{method:icfc}). In Figure~\ref {fig:overall_pipeline}, it is stated as Target-Context Fusion Block. In this way, the context set guides the model towards expected anatomical structures.  This target-context fusion mechanism draws inspiration from prior In-Context-Learning designs such as Neuralizer \cite{neuralizer}.
\end{enumerate}

At the end of the bottleneck and decoder blocks, transpose convolutional layers are applied to restore spatial resolution. Also, skip connections link the encoder and decoder stages. Finally, the output mask $\hat{y}_t$ is generated by the segmentation head, which consists of a  $1 \times 1$ convolution layer followed by a sigmoid activation function. An overview of the complete architecture is presented in \textbf{Figure \ref{fig:overall_pipeline}}.\\
The main role of the NMB is to reduce the effect of ultrasound speckle on feature activations. It refines feature representations using residual and variance noise maps and acts as a noise-aware normalization mechanism across the network. Thus, it reduces the effect of noise-related variations and ensures that features representing similar anatomical structures are more comparable, even when noise patterns or scanner settings vary. As a result, context conditioning operates in a more semantically meaningful feature space, allowing contextual information to guide segmentation more effectively. Without NMB, noise variability can propagate through the network and weaken context alignment. Therefore, NMB complements rather than replaces contextual guidance. Additionally, unlike all previous in-context learning approaches that rely solely on a sequence of target–context fusion blocks, we introduce multi-stage feature extraction before the fusion stage. This design helps the model to capture richer, more informative feature representations that interact effectively, increasing its capabilities.
%Let $x_t \in \mathbb{R}^{H \times W}$ denote the target image and 
%$C = \{x_i, y_i\}_{i=1}^L$ the context set. 
%The model learns the mapping
%\begin{equation}
%f_\theta(x_t, C) \rightarrow \hat{y}_t,
%\end{equation}
\input{figures/methodology/Architecture}


%\paragraph{Feature Extractor at Each Encoder–Decoder Stage}
%Our architecture draws inspiration from prior ICL-based segmentation designs such as Neuralizer \cite{neuralizer}, especially for the ICFC block. Unlike all previous in-context learning approaches that rely solely on a sequence of target–context fusion blocks, we introduce multi-stage feature extraction before the fusion stage. More specifically, at every encoder–decoder stage, the target representation is first processed by two $3\times3$ Conv–BatchNorm–ReLU modules for feature extraction, producing feature maps stated as $F_k^t$. These maps are then passed to the subsequent Noise Modulation Block (NMB), described in the following paragraph. To obtain meaningful contextual representations, each context image $i$ is processed through the same encoder–decoder architecture of $3\times3$ Conv–BatchNorm–ReLU blocks to generate context feature maps $F_k^i$. The encoder–decoder architecture shares its weights across the context set, but for the input image, we use separate weights. This design helps the model to capture richer, more informative feature representations that interact effectively, increasing its capabilities.

\subsection{Noise Modulation Block for Ultrasound Robustness}
\label{method:nmb}
%Ultrasound data are corrupted by spatially varying \textit{speckle noise}, which can blur anatomical boundaries.  
To enhance the quality of ultrasound segmentation, two complementary maps are utilized: (i) a residual noise map, computed as the difference between the image and its Gaussian-smoothed version. and (ii) a local variance map, which measures how much the intensity fluctuates within a small neighborhood. The variance map highlights regions dominated by speckle, where feature activations should be suppressed. On the other hand, the residual noise map captures details, such as anatomical edges, that should be preserved. Combining both, we provide a balanced representation of ultrasound patterns. 
Thus, we integrate \textbf{NMB} at each encoder and decoder level, as seen in Figure ~\ref{fig:overall_pipeline}.  Two analytic descriptors are utilized  for our Noise Modulation Block: a residual noise map
\begin{equation}
    n_r(x_t) = |x_t - (G_\sigma * x_t)| ,
\end{equation}
where $G_\sigma$ denotes a Gaussian smoothing kernel. The second one is a local variance map, which identifies regions with high variability.
\begin{equation}
    n_v(x_t) = \mathbb{E}[x_t^2] - (\mathbb{E}[x_t])^2 ,
\end{equation}
These maps are precomputed before the forward pass for each image during training and inference. At each encoder–decoder level, they are resized and combined to form the modulation factor $M_k$ with the following equation:
\begin{equation}
M_k = 1 + \alpha_k n_r(x_t) + \beta_k n_v(x_t),
\end{equation}
where $a_k$ and $b_k$ are learnable scalar parameters.  Consequently, the feature map $F_k^t$ is then refined as
\begin{equation}
F_k^{t,\text{mod}} = F_k^t \odot M_k.
\end{equation}
The process is depicted in \textbf{Figure \ref{fig:noise_block}}. In this way, the target feature representations become semantically richer and less affected by speckle. Thus, their interaction with context features in ICFC module is more reliable.
\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\linewidth]{images/methodology/3.jpg}
    \caption{Noise Modulation Block (NMB).  
    Residual and variance maps modulate activations through learnable weights, improving boundary preservation under speckle noise.}
    \label{fig:noise_block}
\end{figure}


\subsection{In-Context Feature Conditioning (ICFC) Module}
\label{method:icfc}
Context images are encoded using the exact and context feature maps $F_k^i$ are generated. To fuse information from context feature maps, we perform channel concatenation of each context feature with the noise-modulated feature map of the input image:
\begin{equation}
f_k^{c,i} = [F_k^{t,\text{mod}} \Vert F_k^i],
\end{equation}
Then, the concatenated features $ f_k^{c,i}$ are passed through a shared $1 \times 1$ convolution.
\begin{equation}
\tilde{f}_k^{c,i} = \phi_k(f_k^{c,i}).
\end{equation}
Subsequently, pairwise contextual features, stated as $\tilde{f}^{c,i}_k$, are aggregated via mean pooling. Finally, a residual refinement with our initial target features $ F^t_k$ is applied to update the target feature map:
\begin{equation}
\hat{F^t_k} = \text{GeLU}\big(F_k^{t,\text{mod}} + \text{Mean}_i(\tilde{f}^{c,i}_k\big),
\end{equation}
The mechanism of target-context information is illustrated in Figure \ref{fig:context_fusion}.
\begin{figure}[H]
    \centering
    \includegraphics[width=0.95\linewidth]{images/methodology/Screenshot_4.jpg}
    \caption{Target-Context fusion block.  
    The target feature map (blue) is concatenated with each context feature map (green). }
    %Then, they are passed through a shared  $1\times1$ convolution layer. Finally, the output features are aggregated via mean pooling and added back to the target feature map.}
    \label{fig:context_fusion}
\end{figure}

%----------------------------------------------------------


%----------------------------------------------------------
%\paragraph{Segmentation Head}

%The decoder mirrors the encoder using transposed convolutions and skip connections to capture low- and high-level features. At each 
%decoder level, we apply the NMB to the upsampled features and then the ICFC block for target-context fusion. Finally, a $1\times1$ convolution is applied, followed by a sigmoid activation that generates the segmentation mask $\hat{y}_t$.
\begin{comment}
\begin{equation}
    \hat{y}_t = \sigma(W h^t_0 + b),
\end{equation}
where $W$ and $b$ denote the parameters of the segmentation head.
\end{comment}
\begin{comment}
\subsection{Training and Inference}
ICL-NoiseUNet was trained using the PyTorch Lightning framework, which supports distributed training, checkpointing and early stopping. Each training batch consists of one target image and mask, accompanied by a fixed number of $L = 4$ context examples sampled randomly. The context size $L$ was selected based on an ablation analysis (see Section~\ref{sec:context_ablation}). In particular, smaller context sizes ($L < 4$) reduce representational variety and weaken in-context conditioning, while larger sets ($L > 4$) offer no gains and may introduce less relevant examples. This happens because support images deviate significantly in L2 distance to the target image. Furthermore, increasing $L$ beyond 4 substantially increases GPU memory consumption and training time without improving generalization. In addition, the window size for the residual and local variance map estimation is set to $7$. This choice provides the best balance between capturing sufficient local context and preserving structural details (see ~\ref{appendix:nmb_ablation} for further information).
%The context size was selected as a practical balance between accuracy, efficiency, and computational cost. Using fewer $4$ examples as a support set reduced contextual diversity, whereas using more than $4$ does not yield any performance gains but requires substantially more GPU memory, further increasing the computational burden of an already large model.
Data augmentation is applied only to training targets, while context samples remain unchanged. The augmentation strategy includes horizontal and vertical flips, rotations, elastic deformations, zooming, Gaussian noise addition and brightness–contrast adjustment. With this selective augmentation, we preserve the integrity of the structural priors within the training set. Moreover, for each dataset we follow the $70-15-15$ split for training, validation, and testing. Training is performed for $50$ epochs with a batch size of 4, using the AdamW optimizer with a learning rate of $1\times 10^{-5}$ and weight decay of $1\times 10^{-7}$. An NVIDIA GeForce GTX $1080$ GPU is utilized for training and inference. For improved speed, we have also included a distributed data-parallel mode in our code. Model checkpoints are saved according to validation loss and early stopping is triggered when no improvement occurs for $8$ consecutive epochs.
\paragraph{Inference}
During inference, the $4$ context examples are chosen based on the lowest $L_2$ distance between the target and samples in the context pool. This ensures that the images from the support set are visually similar and relevant. These context images are passed through the encoder to extract contextual features, which are then fused with the target features using the In-Context Feature Conditioning (ICFC) module. This setup allows the model to adapt its segmentation to the anatomical structures and noise characteristics represented in the support set. We use $L_2$ for two main reasons: (1) it is computationally efficient, avoiding the higher cost of perceptual metrics such as SSIM \cite{SSIM} or LPIPS \cite{LPIPS}; and (2) it is deterministic and parameter-free, unlike perceptual metrics that require additional hyperparameters or pretrained models.  
The network produces a probability map, which is converted to a binary mask using a threshold of $0.5$. We report quantitative results using standard metrics, such as Dice coefficient, Intersection-over-Union (IoU), Precision and Recall.

\end{comment}



%--------------------------------------------- 

%--------------------------------------------- 
\begin{comment}
\subsection{Novelty and Relevance to Ultrasound Imaging}  

Our model introduces a unified perspective for ultrasound segmentation by embedding contextual reasoning and noise adaptation directly within the UNet feature hierarchy.  In other words, ICL-NoiseUNet performs \emph{feature-level in-context learning} and \emph{analytic noise-aware modulation} as intrinsic components of the network.  

This innovation is particularly impactful for ultrasound imaging because:
\begin{itemize} 
    \item \textbf{Contextual reasoning}: It mimics clinical reasoning by leveraging a few reference cases to guide segmentation of new scans.  
    \item \textbf{Noise-aware learning}: Explicitly captures speckle noise, leading to robustness in the segmentation mask outputs;  
    \item \textbf{Hierarchical integration}: Indicates that contextual adaptation and noise map guidance can be jointly applied in broader model architectures.
\end{itemize}  

Thus,  a \emph{task-agnostic, noise-aware segmentation framework} is created, which constitutes a significant step towards generalizable application of contextual information in ultrasound analysis.  



\paragraph{Summary}
The proposed ICL-NoiseUNet combines three synergistic components:
\begin{enumerate}
    \item An \textbf{in-context feature conditioning mechanism} that dynamically integrates priors from few-shot reference examples.
    \item A set of \textbf{Noise Modulation Blocks} that adapt activations to local speckle noise statistics.
    \item A unified \textbf{encoder–decoder framework} that performs multi-scale fusion of context and target information for precise segmentation.
\end{enumerate}
This end-to-end design jointly models contextual reasoning and noise adaptation, yielding robust, anatomically consistent predictions in ultrasound domains.

\end{comment}
%\newpage