\section{Experiments and Results}
We design our experiments to test whether the combination of two key ideas of ICL-NoiseUNet, the feature refinement by the NMB and the conditioning of the segmentation model from the context set, improves ultrasound segmentation. Because of the unique properties of the ultrasound modality, we hypothesize that NMB should stabilize early representations, while context examples should guide the model towards plausible anatomical priors. Thus, we evaluate: (i) the contribution of each component to the model's capability (ablations), (ii) the model's performance against state-of-the-art models, (iii) the effect of context size selection and (iv) cross-dataset performance robustness within the same task. More specifically, we test \textbf{ICL-NoiseUNet} across $4$ ultrasound segmentation tasks: fetal head, breast lesion, thyroid gland and cardiac chamber. Furthermore, information regarding the publicly available used datasets: RADBOUD \cite{vandenHeuvel2018}, JNU \cite{Lu.2022}, BUSI \cite{AlDhabyani.2020}, BUS-BRA \cite{WilfridoGomezFlores.2023}), Magdeburg Thyroid \cite{Wunderling2017} and TG3K \cite{gong2022thyroid} can be found in Appendix ~\ref{appendix:datasets}. For all datasets, splits are performed at a patient level, ensuring that no patient appears in more than one split and preventing leakage of highly similar frames across training, validation, and test sets. Additionally, context is retrieved only from the training data, and all patient identities are completely separate from those used in evaluation and testing. For video-based datasets (JNU-IFM \cite{Lu.2022} and CAMUS \cite{Leclerc.2019}), all frames from a given video are assigned to a single split. For BUSI \cite{AlDhabyani.2020} and BUS-BRA \cite{WilfridoGomezFlores.2023}, patient-level splits are additionally stratified by pathology; for BUS-BRA, we also report results stratified by scanner to support the noise/statistics adaptation analysis in Appendix I ~\ref{appendix:robustness}. We train using PyTorch Lightning for $50$ epochs with the AdamW optimizer (lr=$1\times10^{-5}$). Regarding the window size for residual and variance maps, it is set to $7$ (further analysis at Appendix ~\ref{appendix:nmb_ablation}). During training, context samples are randomly drawn from the training set with a fixed size of $L=4$ (see Section~\ref{sec:context_ablation} for explanation). At inference time, the $4$ context images are selected from the training set using the lowest $L_2$ distance to the target image. We use L2 as it is fast to compute, deterministic and does not require additional parameters, unlike perceptual metrics such as SSIM~\cite{SSIM} or LPIPS~\cite{LPIPS}. Full implementation details, augmentation transformations, data splits and hardware specifications are provided in Appendix~\ref{appendix:implementation}. 

%\subsection{Implementation Details}
%ICL-NoiseUNet was implemented using the PyTorch Lightning framework. Each training batch consists of one target image and mask. The context set consists of a number of images that belong to the \textbf{training set} and are selected with \textbf{random sampling}. Also, the context size is fixed at $L=4$ for all experiments, and an ablation study is conducted in Section~\ref{sec:context_ablation} Based on our analysis in Appendix ~\ref {appendix:nmb_ablation}, the window size for the residual and local variance map estimation for our Noise Modulation Block is set to $7$. 
%Data augmentation is applied only to training targets, while context samples remain unchanged. The augmentation strategy includes horizontal and vertical flips, rotations, elastic deformations, zooming, Gaussian noise addition and brightness–contrast adjustment. Moreover, for each dataset, we follow the $70-15-15$ split for training, validation, and testing. Additionally, training is performed for $50$ epochs with a batch size of 4, using the AdamW optimizer with a learning rate of $1\times 10^{-5}$ and weight decay of $1\times 10^{-7}$. An NVIDIA GeForce GTX $1080$ GPU is utilized for training and inference. For improved speed, we have also included a distributed data-parallel mode in our code. Model checkpoints are saved according to validation loss and early stopping is triggered when no improvement occurs for $8$ consecutive epochs. \\
%During inference, the $4$ context examples are chosen based on the lowest $L_2$ distance between the target and samples in the context pool. This ensures that the images from the support set are visually similar and relevant. We use $L_2$ for two main reasons: (1) it is computationally efficient, avoiding the higher cost of perceptual metrics such as SSIM \cite{SSIM} or LPIPS \cite{LPIPS}; and (2) it is deterministic and parameter-free, unlike perceptual metrics that require additional hyperparameters or pretrained models. The network produces a probability map, which is converted to a binary mask using a threshold of $0.5$. We report quantitative results using standard metrics, such as Dice score and Intersection-over-Union (IoU).


\subsection{Ablation Study for Context Conditioning and Noise Modulation Block}
Table~\ref{tab:busi_busbra} presents the ablation experiments on BUSI \cite{AlDhabyani.2020}, BUS-BRA \cite{WilfridoGomezFlores.2023}, and RADBOUD \cite{vandenHeuvel2018} datasets to analyze the contribution of contextual conditioning and NMB. We observe that by removing either component reduces Dice and IoU, confirming that both mechanisms provide complementary benefits. On BUSI \cite{AlDhabyani.2020}, Dice improves from 0.74 to 0.80 when both modules are active, while on BUS-BRA \cite{WilfridoGomezFlores.2023}, the full ICL-NoiseUNet achieves 0.93 Dice and 0.87 IoU, representing an improvement of over 3\% compared to single-component variants. Similarly, on RADBOUD \cite{vandenHeuvel2018}, our model achieves a Dice of 0.96, demonstrating a significant gain in segmentation stability over the reduced variants.\\
All differences relative to the full model are statistically significant, so they validate our hypothesis. A detailed analysis of the complementary effect of variance and residual noise maps that form the NMB can be found in Appendix ~\ref {appendix:nmb_ablation}. Additionally, the synergy between context and noise maps is also evaluated when using shared encoder–decoder weights for the target and context (see Appendix \ref{appendix:shared_weights}), showing only a minimal change in performance. 
In addition, an analysis of the learned modulation parameters ($\alpha_k$ and $\beta_k$) across network depth and datasets is provided in Appendix~\ref{appendix:modulation_parameters}. More specifically, it offers further insights into how the model adapts noise handling to different tasks.
 \input{tables/breast_Context_Unet}
\paragraph{Sensitivity Analysis of the Noise Modulation Block}
To evaluate the contribution of each of the two components of the NMB, we perform an ablation study on the CAMUS \cite{Leclerc.2019} and BUS–BRA \cite{WilfridoGomezFlores.2023} datasets. 
We compare four variants: (1) Full NMB (residual + variance maps), (2) Residual noise–only computation, (3) Variance–only computation, and (4) No NMB. As shown in Table ~\ref{tab:nmb_ablation}, the full design that combines the residual and variance maps achieves the strongest results, whereas single-component variants perform consistently worse and the removal of the entire NMB leads to the largest performance drop. Additionally, all differences relative to the full model are statistically significant which confirms their complementary effect. Moreover, this argument is also enhanced by qualitative examples shown in Appendix ~\ref{appendix:nmb_ablation}. There, it is shown that the residual-only variant preserves sharp boundaries but often misses low-contrast or blurred regions, resulting in increased false negatives. In contrast, the variance-only variant suppresses speckle but frequently overextends into surrounding tissue, leading to a higher number of false positives. By combining both descriptors, the full block results in more anatomically consistent segmentation.
\paragraph{Feature Representation Analysis}
To further analyze the impact of NMB, we visualize bottleneck feature representations using t-SNE on 100 BUS-BRA \cite{WilfridoGomezFlores.2023} samples (33 benign, 67 malignant) (Fig. \ref{fig:feature_level}, Appendix \ref{appendix:representation}). Without NMB, benign and malignant features are poorly separated, indicating weak class awareness in the latent space. In contrast, incorporating NMB results in more compact and class-aligned clusters with significantly reduced overlap, which is also reflected by a much higher silhouette score (0.26 vs. 0.05). Although t-SNE is primarily a visualization tool, the improved separability suggests that NMB promotes more structured and discriminative feature representations. Consequently, this improved representation structure helps explain the observed gains in segmentation performance.
\input{tables/ablation_nmb}
\subsection{Comparison with SOTA Models}
For fair comparison, the evaluated models are trained under an identical experimental setup. In particular, we follow identical dataset splits, data augmentations, training and inference settings. Furthermore, in datasets that containing multiple classes, we ensure that each split preserves the class proportions of the full dataset. Finally, for SAM-based models, we provide positive points as prompts. More specifically, for SAM-family models, we conducted 5 independent runs per test image. At each run, we use $5$ positive point prompts sampled from within the ground truth foreground mask: $4$ points close to the spatial extremes (xmin, xmax, ymin, ymax of the mask bounding region) to provide boundary coverage, plus 1 additional point sampled randomly from the interior region. This strategy provides SAM with good spatial coverage while introducing variability across runs through the randomly sampled points.
Table~\ref{tab:general_sota} compares ICL-NoiseUNet with CNN-based and foundational models on the BUS-BRA \cite{WilfridoGomezFlores.2023} dataset. The proposed model achieves a Dice score of $93\%$, beating UltraSAM \cite{Meyer_2025} by $8\%$ and MedSAM2 \cite{ma2025medsam2segment3dmedical} by $13\%$. These results highlight the benefits of combining contextual conditioning with analytic noise modulation. To further assess our model's robustness, we evaluate it additionally on CAMUS \cite{Leclerc.2019} and BUSI \cite{AlDhabyani.2020} datasets.  Consequently, as reported in Table~\ref{tab:general_sota}, our method reaches Dice scores of $0.94$ on CAMUS \cite{Leclerc.2019}, $0.80$ on BUSI \cite{AlDhabyani.2020} and $0.96$ on RADBOUD \cite{vandenHeuvel2018}, outperforming a range of competitors, including SwinUNet \cite{cao2021swinunetunetlikepuretransformer}, nnU-Net \cite{isensee2018nnunetselfadaptingframeworkunetbased}, MedSAM2 \cite{ma2025medsam2segment3dmedical} and the most recent in-context learning model, MultiverSeg \cite{wong2025multiversegscalableinteractivesegmentation}. 
\input{tables/comparison_busi_camus_busbra}. 
%Moreover, Table~\ref{tab:radboud_comparison} presents results on the RADBOUD \cite{vandenHeuvel2018} dataset for the fetal head segmentation task. ICL-NoiseUNet achieves a Dice of 0.965 and IoU of 0.947, outperforming U-Net, U-Net++, and W-Net baselines with lower variance.  In summary, ICL-NoiseUNet demonstrates superior performance, resulting in an effective reduction in false positives. In Figure~\ref{fig:camus_comp}, we observe that our method generates smoother, more accurate boundaries than other models. More specifically, the baseline methods miss key anatomical details and create irregular edges. Thus, ICL-NoiseUNet achieves better overall segmentation results.
%\input{tables/comparison_busi_camus_busbra}

\input{figures/comparison}
We conducted further experiments comparing against established few-shot segmentation techniques (PANet \cite{panetfewshotimagesemantic} and feature-wise conditioning mechanisms that are applied to our noise maps (FiLM \cite{filmvisualreasoninggeneral} and Conditional BatchNorm \cite{cbn}). ICL-NoiseUNet consistently outperforms all baselines on both CAMUS \cite{Leclerc.2019} and scanner-stratified BUS-BRA \cite{WilfridoGomezFlores.2023} (detailed results in Appendix ~\ref{appendix:few_shot}). It is demonstrated that PANet struggles in ultrasound segmentation, while generic feature-wise conditioning methods are less effective than our NMB design.
\subsection{Effect of Context Size on Segmentation Performance}
\label{sec:context_ablation}

To assess the impact of context size on our model's performance, we trained ICL-NoiseUNet with different context sizes of ($L = 1, 2, 4, 8, 16$) and measured Dice coefficient scores across the BUS-BRA \cite{WilfridoGomezFlores.2023}, CAMUS \cite{Leclerc.2019} and RADBOUD \cite{vandenHeuvel2018} datasets. Figure 1 ~\ref{fig:context_ablation} shows the effect of context size in segmentation performance across each dataset (more detailed results for BUS-BRA \cite{WilfridoGomezFlores.2023} are reported in Appendix~\ref {appendix:context_ablation_table}). There, we observe a slight but consistent improvement as the context size increases from $L=1$ to $L=4$. As the size increases further, the performance stops improving and gets noticeably worse, especially in the BUS-BRA \cite{WilfridoGomezFlores.2023} dataset. This happens because larger context sizes may introduce redundant or less relevant examples whose distribution differs significantly from the target image. Therefore, the effectiveness of context guidance is reduced. The same pattern is observed across all evaluated segmentation tasks, with $L=4$ achieving the best trade-off between contextual diversity and computational efficiency. Thus, we select $L=4$ for all main experiments. In contrast to methods such as Neuralizer \cite{neuralizer} and Universeg \cite{butoi}, which rely solely on contextual information, our framework benefits from the synergy between the base encoder–decoder backbone, analytic noise descriptors and context conditioning. In addition, we evaluate our model’s sensitivity to the context selection strategy by comparing retrieval based on L2 distance, SSIM index, and random selection. For random selection, we draw the context set from a 10\% subsample of the training set. Results at Appendix ~\ref{appendix:context_sel} indicate minimal differences at BUS-BRA \cite{WilfridoGomezFlores.2023} (0.911 vs 0.902 vs 0.910) and CAMUS \cite{Leclerc.2019} (0.940 vs 0.937 vs 0.936) datasets, respectively. Consequently, these results confirm our model's robustness to the context selection method and that it does not require precise similarity matching. Moreover, the analysis in Appendix~\ref{appendix:context_inference} confirms the model’s robustness to context size variations during inference.
\input{figures/context_size_ablation}
%\input{tables/comparison_busi_camus_BUS-BRA}
%\input{tables/busbra_SAM_SOTA}
%\input{tables/Radboud_comp_baselines}
%To further evaluate the robustness of ICL-NoiseUNet to variations in context size, we trained the model using a fixed context size of $L=4$ and subsequently tested it with different context sizes across datasets. Figure~\ref{fig:context_ablation_busb} shows the Dice and IoU distributions for the BUS-BRA \cite{WilfridoGomezFlores.2023} and CAMUS \cite{Leclerc.2019} datasets, respectively. The results demonstrate that segmentation performance remains stable across a wide range of context sizes, with only marginal fluctuations in both metrics. In other words, the model effectively captures contextual information during training, enabling reliable segmentation even when the available contextual information changes at inference time.
%\input{figures/context_inference}

\subsection{Cross-domain evaluation}
To assess domain generalization, Table~\ref{tab:transfer_results} reports cross-dataset results in fetal head and thyroid gland segmentation. We train the model on JNU-IFM \cite{Lu.2022} and Thyroid-Magdeburg \cite{Wunderling2017} and test directly on RADBOUD \cite{vandenHeuvel2018} and TG3K \cite{gong2022thyroid}. It is shown that ICL-NoiseUNet achieves Dice scores of $0.901$ and $0.921$ on the respective test datasets, outperforming baselines that were trained entirely on the RADBOUD \cite{vandenHeuvel2018}. The results confirm that noise modulation enhances the model’s ability to generalize across datasets of the same task without the need for retraining. Additionally, we compare ICL-NoiseUNet with MixStyle \cite{mixstyle}, a style-based feature augmentation method, which performs slightly worse in these cross-dataset settings (0.901 vs. 0.911; 0.921 vs. 0.911). This highlights that explicitly modeling ultrasound noise provides stronger generalization than style-based augmentation alone.

%\input{tables/Radboud_comp_baselines}
\begin{comment}
\subsection{Fetal Head Segmentation Results}
Table~\ref{tab:radboud_comparison} presents results on the RADBOUD \cite{vandenHeuvel2018} fetal head segmentation benchmark. ICL-NoiseUNet achieves a Dice of 0.965 and IoU of 0.947, outperforming U-Net, U-Net++, and W-Net baselines with lower variance and superior boundary adherence. This performance gain demonstrates that integrating contextual priors and noise descriptors strengthens anatomical consistency and segmentation precision, even in images degraded by high speckle levels.

\input{tables/Radboud_comp_baselines}
\subsection{Summary}
Overall, ICL-NoiseUNet demonstrates strong generalization, stability, and robustness across several evaluated datasets. The model’s ability to preserve anatomical detail and suppress speckle noise variations shows that it is a practical and interpretable architecture for ultrasound image segmentation.
\end{comment}
%\input{tables/comparison_busi_camus_busbra}
%\input{tables/busbra_SAM_SOTA}
%\input{tables/Radboud_comp_baselines}
\input{tables/cross_domain_thyroid_fetal}
\subsection{Limitations and Future Work}
%\textcolor{red}{ In many existing segmentation models, performance degrades when acquisition conditions change due to differences in imaging devices, populations, or other acquisition parameters. To address this, our context-based segmentation approach leverages a small set of context examples to enable the model to adapt at test time to new datasets without requiring retraining.}
Overall, ICL-NoiseUNet demonstrates strong generalization, stability, and robustness across several evaluated datasets. The model’s ability to preserve anatomical detail and suppress speckle noise variations shows that it is a practical and interpretable architecture for ultrasound image segmentation. 
Importantly, the proposed framework is not restricted to a U-Net \cite{Ronneberger.2015} backbone. The NMB and ICFC components are modular and can be integrated into other segmentation architectures. For transformer-based models,  the NMB could be integrated after each transformer block and refine the output features with resolution-matched noise maps, while ICFC can be applied by fusing target and context features at each transformer block output. For more lightweight convolutional approaches, NMB can be placed after each convolutional block in the encoder and decoder, with ICFC applied at the corresponding feature levels. Therefore, we plan to evaluate their effectiveness within transformer-based and hierarchical medical imaging models.
Although ICL-NoiseUNet achieves strong segmentation performance, it remains a relatively heavyweight model, with a higher number of parameters compared to baseline architectures. Nevertheless, its inference time is still comparable to other models, as summarized in Appendix~\ref{appendix:inference}. \\
Future work will also focus on developing more parameter-efficient architectures that combine noise maps with contextual information.
Regarding failure cases, we identify two main limitations of our method. Firstly, performance degrades under extreme speckle noise and severe acoustic shadowing, such as in some samples of the BUSI dataset \cite{AlDhabyani.2020}, where anatomical boundaries are heavily obscured. In these cases, noise overwhelms the underlying anatomical signal to the extent that even context-guided feature refinement cannot reliably recover boundary information. Secondly, our in-context learning approach is task-specific and requires anatomically relevant context examples. Using context images from a different anatomical task (e.g., cardiac contexts for breast lesion segmentation) provides no meaningful guidance, leading to a significant drop in performance due to incompatible structural priors across anatomical domains.

 
