\documentclass{midl} % Include author names
\usepackage{mwe} % to get dummy images
\jmlrvolume{-- nnn}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[Benchmarking for DL models for IPH seg on NCCT]{Multi-site Benchmarking of Deep Learning Models for Intraparenchymal Hemorrhage Segmentation on NCCT}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % four or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicated cases, e.g. with dual affiliations and joint authorship
\midlauthor{%
\Name{Kauê {T N Duarte}\midljointauthortext{Corresponding author}\nametag{$^{1}$}} \orcid{0000-0002-4074-3672}\\
\Name{Abhijot {S Sidhu}\nametag{$^{1,2,3}$}} \orcid{000-0002-4839-9466}\\
\Name{Murilo {C Barros}\nametag{$^{4}$}} \orcid{0000-0003-2452-8128}\\
\Name{Taha Aslan\nametag{$^{6}$}} \orcid{0000-0002-2768-5105}\\
\Name{Donghao Zhang\nametag{$^{5}$}} \orcid{0009-0002-8185-7577}\\
\Name{Jianhai Zhang\nametag{$^{1}$}} \orcid{0000-0002-0330-6908}\\
\Name{Devansh Bhatt\nametag{$^{1}$}}\\
\Name{Brij Karmur\nametag{$^{1}$}} \orcid{0000-0002-4224-173X}\\
\Name{Mohamed AlShamrani\nametag{$^{6}$}} \orcid{0009-0003-3009-0537}\\
\Name{Wu Qiu\nametag{$^{5}$}} \orcid{0000-0001-7827-8270}\\
\Name{Aravind Ganesh\nametag{$^{1}$}} \orcid{0000-0001-5520-2070}\\
\Name{Bijoy {K Menon}\nametag{$^{1}$}} \orcid{0000-0002-3466-496X}\\[1ex]
%
\addr $^{1}$ Cummings School of Medicine, University of Calgary, Calgary, AB, Canada.\\
%\addr $^{2}$ Departments of Radiology, Hotchkiss Brain Institute, Cummings School of Medicine, University of Calgary, Calgary, AB, Canada.\\
\addr $^{2}$ Graduate Program in Biomedical Engineering, University of Calgary, Calgary, Canada.\\
\addr $^{3}$ Seaman Family MR Research Centre, Foothills Medical Centre, Calgary, Canada.\\
\addr $^{4}$ School of Technology, University of Campinas, Limeira, Brazil.\\
\addr $^{5}$ College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China.\\
\addr $^{6}$ Calgary Stroke Program, Department of Clinical Neurosciences, Foothills Medical Centre, University of Calgary, Calgary, Canada.
}

% \midlauthor{\Name{Author Name1\midljointauthortext{Contributed equally}\nametag{$^{1,2}$}} \orcid{1111-2222-3333-4444} \Email{abc@sample.edu}\\
% \addr $^{1}$ Address 1 \\
% \addr $^{2}$ Address 2 \AND
% \Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}\\
% \Name{Author Name3\nametag{$^{2}$}} \Email{alphabeta@example.edu}\\
% \Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} \Email{uvw@foo.ac.uk}\\
% \addr $^{3}$ Address 3 \AND
% \Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
% \addr $^{4}$ Address 4
% }

\begin{document}

\maketitle

\begin{abstract}
Intraparenchymal hemorrhage (IPH) is a critical and often fatal subtype of hemorrhagic stroke, requiring rapid and accurate diagnosis on non-contrast computed tomography (NCCT) scans for effective treatment. While deep learning (DL) models, particularly convolutional neural networks (CNNs), offer potential for automating IPH segmentation, their real-world clinical utility is often limited by the lack of explicit data integration across diverse hospital sites with varying imaging protocols. This study conducted a multi-site benchmarking of \textcolor{black}{five} prominent CNN architectures: baseline U-Net, Attention U-Net, Feature Pyramid Network (FPN), \textcolor{black}{Swin U-Net}, and Trans U-Net, for IPH segmentation on a heterogeneous dataset from 17 clinical sites. Models were rigorously evaluated using F-measure (\textit{a.k.a.}, Dice), Intersection over Union (IoU), and 95\% Hausdorff Distance ($d_{H95}$). The advanced CNN variants (Attention U-Net, FPN, Trans U-Net) significantly outperformed the baseline U-Net in F-measure and IoU (\textit{e.g.}, FPN F-measure: $0.868$ vs. U-Net: $0.819$, $p<0.001$), with no significant difference among them. For boundary error, FPN reduced $d_{H95}$ compared to the baseline, whereas Trans U-Net showed improvement, though it was not significant. These models exhibited robust cross-site generalization across hemorrhage volumes, with minimal site-specific effects on performance. This study demonstrates that advanced CNN variants can be adopted for IPH segmentation to standardize and potentially accelerate IPH diagnosis.


\end{abstract}


\begin{keywords}
stroke, intraparenchymal hemorrhage, artificial intelligence, medicine, computed tomography.
\end{keywords}

%-------------------------------------+
   \section{Introduction}            %|
   \label{sec:introduction}          %|
%-------------------------------------+

Stroke is a major cause of death and long-term disability globally. Each year, more than 12 million cases and over 7 million deaths are reported  \cite{Feigen}. Among these cases, hemorrhagic stroke is one of the deadliest types, as it causes a rupture of cerebral blood vessels and subsequent intracranial bleeding. Although accounting for a smaller portion of the stroke cases, this type is associated with a high fatality rate.

Non-contrast computed tomography (NCCT) is a commonly used medical imaging modality for detecting stroke. It not only provides rapid, accessible information on intracranial hemorrhage (ICH) but also plays a critical role in emergency diagnosis and treatment planning. Fast detection of hemorrhage can positively salvage brain function and increase the patient's survival rate \cite{ahmed2025ich}. This urgency, in a clinical setting, can affect the time to diagnosis, potentially leading to delays or oversights.

Among the subtypes of ICH, intraparenchymal hemorrhage (IPH) represents a critical and challenging pathology. IPH is characterized by bleeding in the brain tissue ($\sim$15\% of total stroke cases), which leads to a high mortality rate (with 30-day mortality rates of 40-50\%) among the ICH subtypes \cite{roy2015intraparenchymal}. This mortality rate is nearly double that of the fatality caused by ischemic stroke \cite{ref1,ref2}. A baseline hematoma volume is one of the strongest independent predictors of mortality, with patients with volumes $>30 mL$ experiencing a mortality rate $>50\%$ \cite{ref3}. This volume often drives therapeutic choices, such as selecting candidates for minimally invasive evacuation or deciding on surgery after follow-up imaging \cite{ref4}. Unlike other subtypes of ICH, like subdural or subarachnoid hemorrhage, where surgical evacuation is primarily anatomically guided, IPH management often relies heavily on precise volumetric quantification.
%
However, accurately measuring IPH is challenging due to irregular lesion boundaries, variable texture patterns, and proximity to complex anatomical structures, all of which can affect measurements. These typically require specialized, robust analytical tools to segment and measure these regions in different sites.

Artificial Intelligence (AI) techniques have accelerated stroke diagnosis by automating manual tasks, such as detection and segmentation, while maintaining high accuracy rates. Among AI types, convolutional neural networks (CNNs) automatically extract information from images and are primarily used for classification tasks, such as ASPECTs scoring. For semantic segmentation, CNN-based U-Net variants and their numerous adaptations are often adopted, as they build an encoder-decoder architecture that not only extracts features but also reconstructs them in image space \cite{lin2025attention}. These models have demonstrated high confidence in distinguishing pathological tissue from healthy tissue \cite{m3sl}. The advanced U-Net and its variants are continually improving, either by integrating attention mechanisms to delineate lesion boundaries more accurately or by using fractal pyramid networks to capture fine-grained and global contextual details. % via recursive multi-scale feature extraction.

One major challenge for clinical translation is domain shift across sites, which introduces variability across scanner vendors, acquisition protocols, and site practices. Variability in AI contexts can degrade model performance if not adequately tested in real-world clinical settings. The literature often refers to models trained on curated datasets, yet lacks systematic, comparative validation of these architectures for IPH segmentation across multiple clinical sites. Additionally, numerous studies have trained models on public data, addressing multiple hemorrhage types simultaneously, rather than optimizing for the complexities of IPH individually \cite{ahmed2025ich, piao2023transhardnet}. The focus on architectural novelty can also overshadow deeper investigation of IPH's intrinsic features. % Furthermore, the emphasis on architectural novelty sometimes overshadows a deeper investigation into which model families most effectively capture the inherent features of lesions and how these architectures generalize across diverse clinical environments. 

We propose a study focused on IPH segmentation across multiple sites. We benchmarked \textcolor{black}{five} CNN architectures (U-Net, Attention U-Net, FPN, Swin U-Net, and Trans U-Net) on a heterogeneous, multi-site NCCT dataset and report F-measure, IoU and $d_{H95}$. We evaluate generalizability by assessing statistical values using several metrics.%, aiming to reduce inter-observer variability and enable more consistent segmentation.

The remainder of this paper is organized as follows. Section \ref{sec:related_work} reviews related work relevant to the study. Section \ref{sec:method} details the materials, methods, and statistical definitions employed. Section \ref{sec:results} presents the results, and Section \ref{sec:discussion} provides an analysis of these findings. Finally, Section \ref{sec:conclusion} outlines the conclusions and suggests directions for future research.

%-------------------------------------+
   \section{Related Work}            %|
   \label{sec:related_work}          %|
%-------------------------------------+

The accurate and timely segmentation of intracranial hemorrhage on NCCT is essential for acute stroke management, impacting diagnostic speed, treatment planning, and, ultimately, patient outcomes \cite{ahmed2025ich}. However, manual interpretation by radiologists can be time-consuming and is often subject to inter-observer variability \cite{inkeaw2022automatic}. Thus, deep learning models can solve this by automating segmentation, thereby reducing diagnostic delays and standardizing analysis \cite{piao2023transhardnet}.

In response to pressing clinical needs, researchers have concentrated on developing advanced segmentation architectures. Models such as the U-Net framework and its variants, U-Net++, Attention U-Net, and ResU-Net, have demonstrated strong performance on curated public benchmarks \cite{lin2025attention}. \textcolor{black}{In a systematic review, Zarei \textit{et al.} \cite{zarei2024deep} corroborated that U-Net-based architectures are particularly powerful for IPH segmentation on NCCT.} More recently, transformer-based models and hybrid architectures, such as TransHarDNet, have been explored to more effectively capture long-range dependencies \cite{piao2023transhardnet}. \textcolor{black}{Aside from segmentation tasks, studies such as Gong \textit{et al.} \cite{gong2023unified} proposed unified frameworks for simultaneous ICH volume quantification (regression) and patient prognosis, using a feature extractor based on 3D ResNet and adopting Grad-CAM for enhanced interpretability.} These models consistently achieve high Dice Similarity Coefficients, sometimes reaching 0.85 on their respective test sets \cite{ahmed2025ich, lin2025attention}, highlighting the considerable potential of deep learning for this task. For multiple downstream tasks, Zhang \textit{et al.}\cite{donghao} proposed a multi-task study with the focus of understanding the use of DL for several hemorrhage applications.
\textcolor{black}{However, when a large amount of data is available, authors such as Gerbasil \textit{et al.} \cite{gerbasil2025adapting} decide to integrate foundation models. They introduced a semi-automated pipeline for IPH segmentation, combining a fine-tuned YOLOv8-S model for slice-specific lesion detection with a prompted Medical Segment Anything Model.}

\textcolor{black}{Still focused on multiple downstream tasks, Kaur \textit{et al.}\cite{kaur2024deep} introduced models to detect, segment, and classify multiple ICH subtypes, including IPH, using a customized CNN for segmentation and a Hybrid YOLO-HD model for classification. When targeting the IPH specifically, Juan \textit{et al.} \cite{juan2026multi} developed a multitask pipeline for classification, detection, and weakly supervised 3D segmentation. Their framework integrates parallel tasks for ICH classification using SE-ResNeXt-50, perihematomal edema detection with YOLOv8 and RT-DETR, and a novel 3D PHE-pretrained nnU-Net for IPH segmentation using pseudo-labels derived from edema masks.}

However, this predictive performance is often obtained and validated using homogeneous or publicly available datasets collected with standardized imaging protocols \cite{roy2015intraparenchymal}. When implemented in real-world clinical settings, these models can potentially encounter a notable domain shift, resulting in lower performance \cite{inkeaw2022automatic}. This shift is driven by variations in scanner vendors, acquisition parameters (e.g., tube voltage and slice thickness), and reconstruction kernels across hospitals. Although some studies have begun to address this issue through approaches such as multi-window input optimization \cite{inkeaw2022automatic}, a gap remains in validating segmentation models on large, heterogeneous, multi-site datasets. The comparative analysis of how  CNN variants can generalize across multiple clinical sites remains underexplored, despite its vital importance for real-world applications.

This validation paper studies deep learning models for IPH segmentation using a multi-site dataset characterized by significant protocol heterogeneity. The model was designed explicitly for segmenting parenchymal hemorrhage. Unlike existing studies, our work uniquely quantifies the performance and generalization of this specialized model across a diverse, multi-hospital private dataset, rather than simply developing a new architecture based on public data or addressing a wide array of hemorrhage types. This approach provides a crucial real-world evaluation of the challenges of AI deployment in stroke care, particularly highlighting how different architectural strategies maintain performance across varied clinical settings.


%Recent advancements in intraparenchymal hemorrhage (IPH) segmentation have increasingly moved beyond standard convolutional networks toward sophisticated architectures designed to capture complex spatial dependencies. To address the limitations of local convolution operations, researchers have introduced transformer-based approaches, such as the HarDNet-based transformer, which effectively models global context to improve segmentation precision on non-contrast CT scans \cite{piao2023intracerebral}. Concurrently, hybrid models integrating attention mechanisms into residual U-Net frameworks have been proposed to enhance feature extraction, specifically targeting the challenges of multi-label hemorrhage segmentation \cite{lin2025advanced}. These innovations aim to refine boundary delineation and better handle the heterogeneous texture patterns characteristic of acute IPH.

%Parallel to architectural developments, the literature emphasizes the necessity of clinical translation and rigorous validation. New frameworks combining segmentation with classification are being explored to streamline emergency diagnostics, offering potential improvements in rapid stroke assessment \cite{ahmed2025intracranial}. Furthermore, recent work has investigated the adaptation of foundation models to emergency settings, seeking to leverage large-scale pre-training for robust clinical response \cite{gerbasi2025adapting}. Despite these reported successes, a systematic review and meta-analysis of deep learning algorithms for IPH segmentation underscores that while high accuracy is achievable, substantial variability in validation methods persists, necessitating standardized benchmarking to ensure real-world reliability \cite{zarei2024deep}.

%-------------------------------------+
   \section{Materials and Methods}   %|
         \label{sec:method}          %|
%-------------------------------------+

\subsection{Dataset and Participant Demographics}


The study utilized a multi-site dataset with IPH segmentation. In total, $N = 239$ subjects were included from 17 clinical sites across Canada (labelled A-Q in accordance with our ethics board) participating in the ACT Trial imaging collection \cite{bijoy_act}. All imaging data included manual ground-truth annotations for intraparenchymal hemorrhage (IPH), along with descriptive information such as age, sex, and other factors. Table~\ref{tab:cohort} summarizes the demographic and clinical characteristics of the study population.

\begin{table}
\centering
\caption{Demographic and clinical characteristics of the study population across the 17 anonymized clinical sites (A-Q). Data are presented as Mean $\pm$ Standard Deviation for Age (years) and IPH Volume (cm³). Sex is reported as the count of male patients with the corresponding percentage in parentheses. The sample size (N) for each site is also provided.}
\begin{tabular}{c|ccccc}
 & \textbf{A} & \textbf{B} & \textbf{C} & \textbf{D} & \textbf{E} \\
\hline
Age  & 71.37 $\pm$ 15.00 & 59.00 $\pm$ 18.38 & 72.97 $\pm$ 13.74 & 78.50 $\pm$ 12.07 & 78.78 $\pm$ 12.62 \\
Sex & 30 (55.6\%) & 2 (100.0\%) & 21 (52.5\%) & 4 (50.0\%) & 5 (55.6\%) \\
Vol & 12.69 $\pm$ 22.81 & 24.54 $\pm$ 32.42 & 25.29 $\pm$ 50.69 & 19.18 $\pm$ 28.48 & 3.32 $\pm$ 4.37 \\
N & 54 & 2 & 40 & 8 & 9 \\\hline
\multicolumn{6}{c}{}\\

 & \textbf{F} & \textbf{G} & \textbf{H} & \textbf{I} & \textbf{J} \\
\hline
Age  & 70.57 $\pm$ 12.14 & 75.75 $\pm$ 11.57 & 74.17 $\pm$ 9.39 & 72.60 $\pm$ 19.50 & 72.43 $\pm$ 12.99 \\
Sex & 3 (42.9\%) & 3 (37.5\%) & 2 (33.3\%) & 2 (40.0\%) & 18 (64.3\%) \\
Vol & 3.54 $\pm$ 3.09 & 17.05 $\pm$ 36.22 & 10.17 $\pm$ 14.15 & 7.66 $\pm$ 12.61 & 7.50 $\pm$ 17.22 \\
N & 7 & 8 & 6 & 5 & 28 \\\hline
\multicolumn{6}{c}{}\\


 & \textbf{K} & \textbf{L} & \textbf{M} & \textbf{N} & \textbf{O} \\
\hline
Age  & 76.64 $\pm$ 12.18 & 80.89 $\pm$ 11.89 & 84.50 $\pm$ 6.81 & 73.58 $\pm$ 11.68 & 77.33 $\pm$ 7.51 \\
Sex & 2 (18.2\%) & 7 (77.8\%) & 8 (50.0\%) & 12 (46.2\%) & 2 (66.7\%) \\
Vol & 4.87 $\pm$ 5.42 & 34.84 $\pm$ 40.85 & 18.71 $\pm$ 32.98 & 12.25 $\pm$ 28.40 & 6.88 $\pm$ 8.50 \\
N & 11 & 9 & 16 & 26 & 3 \\\hline
\multicolumn{6}{c}{}\\


     & \textbf{P}        & \textbf{Q}       & \textbf{Total}\\\hline

Age  & 75.20 $\pm$ 15.32 & 87.50 $\pm$ 2.12 & 74.40 $\pm$ 13.29\\
Sex  & 3 (60.0\%)        & 2 (100.0\%)      & 126 (52.71\%)\\
Vol  & 0.85 $\pm$ 1.42   & 0.44 $\pm$ 0.56  & 14.27 $\pm$ 30.36\\
N    & 5                 & 2                & 239  \\
\hline
\end{tabular}
\label{tab:cohort}
\end{table}

\noindent\textbf{ACT Trial Imaging Dataset}. We used CT imaging data and corresponding hemorrhage segmentation masks provided by the ACT Trial investigators \cite{bijoy_act}. Site identifiers were anonymized in accordance with ethical and data-sharing agreements. Hemorrhagic stroke diagnoses were confirmed by board-certified neurologists using standardized clinical criteria. Ground-truth segmentation masks were generated using a semi-automated workflow that combines algorithmic lesion proposals with expert manual correction and review.

\subsection{Data Preparation}

To improve the quality of the NCCT scans, we applied skull stripping using SynthStrip, adjusted to CT \cite{synth_strip}. All 3D volumes were standardized to dimensions that are multiples of 64 through zero-padding. Each volume was split into $64 \times 64 \times 64$ patches to facilitate memory-efficient processing. To address class imbalance between hemorrhagic and non-hemorrhagic voxels, we employed a selective patching strategy that retained patches containing at least one hemorrhagic lesion voxel during training, validation, and testing.
%
Intensity normalization was carried out in two steps: (1) we cropped the intensity from -30 to 100 Hounsfield units (HU); (2) we performed min-max normalization, mapping the image intensities to the range $[0.0, 1.0]$. 

\subsection{Deep Learning Architectures}

\textcolor{black}{Our benchmarking study focuses on the U-Net architectures. We designed using only U-Net architectures for two main purposes: (1) to validate and compare commonly deployed architectures in medical image segmentation, and (2) to draw broader conclusions about how specific architecture-controlled enhancements impact performance on an IPH segmentation. We utilized attention mechanisms, multi-scale feature fusion, and transformer-based context modelling, compared to the baseline U-Net. U-Nets use an encoder-decoder architecture with skip connections, enabling efficient feature localization. We intentionally excluded large foundation models (\textit{e.g.}, prompt-driven SAT) and end-to-end self-supervised pretraining to keep the comparison controlled and reproducible, since pretraining introduces orthogonal variables (pretraining data/objective, prompt/adapter design) and typically requires substantially larger unlabeled data to compute.} 






%
We implemented and compared \textcolor{black}{five} state-of-the-art CNN variants for IPH segmentation:

\begin{enumerate}
    \item \textit{Baseline U-Net}: We implemented the original U-Net architecture \cite{ronneberger2015unet} as our baseline model. This encoder-decoder network with skip connections provides a robust foundation for medical image segmentation.
    
    \item \textit{Attention U-Net}: This architecture enhances the traditional U-Net by incorporating attention gates in the skip connections \cite{attention}. The attention mechanisms selectively emphasize relevant spatial features while suppressing irrelevant regions, particularly beneficial for detecting small hemorrhagic lesions and precise boundary delineation.
    
    \item \textit{Feature Pyramid Network (FPN)}: The FPN architecture \cite{FPN} builds a multi-scale feature pyramid through top-down pathways and lateral connections. This design enables effective feature extraction across multiple scales, which is advantageous for detecting hemorrhagic lesions of varying sizes and shapes.

    \item \textit{Trans U-Net}: This architecture leverages a hybrid of a CNN+Transformer design \cite{transunet}, combining U-Net local feature extraction and the Vision Transformer (ViT). The integration of ViT and U-Net yields improved global context for IPH masks and is believed to enhance boundary delineation.

    \item \textcolor{black}{\textit{Swin U-Net}: This architecture \cite{cao2021swin} introduces a hierarchical Swin Transformer as the encoder to capture long-range dependencies across image patches. A symmetric Swin Transformer-based decoder, connected via skip connections, then upsamples the features to generate segmentation maps. }
\end{enumerate}

All architectures utilized a VGG16 backbone \cite{vgg16} for feature extraction in the encoder pathway, consistent with previous work demonstrating its effectiveness for medical image segmentation tasks \cite{m3sl,sibkaue}.% Figure XXX illustrates the detailed architecture of each model.

\subsection{Model Training and Evaluation}

Model training was conducted for a maximum of 300 epochs with an initial learning rate of $5 \times 10^{-4}$. 
\textcolor{black}{We employed the Adam optimizer and reduced the learning rate on plateau using Keras' \texttt{ReduceLROnPlateau}. Specifically, we monitored validation IoU and used the following settings: \texttt{factor=0.5}, \texttt{patience=8}, \texttt{mode='max'}, \texttt{min\_lr=$1\times10^{-7}$}, \texttt{cooldown=0}, and \texttt{verbose=1}. 
Thus, when the monitored validation IoU did not improve for 8 consecutive epochs, the learning rate was multiplied by 0.5, down to a minimum of $1\times10^{-7}$. The model with the highest validation IoU was saved as the final model.}
For each architecture, we trained separate 2D models using axial (2DAxi), coronal (2DCor), and sagittal (2DSag) projections, and obtained final predictions by averaging across these projections (2.5D model).




\noindent\textbf{Loss Function}. To address the class imbalance between \emph{True} and \emph{False} elements in the IPH masks, we used a composite loss function combining Dice Loss ($\mathit{DL}$, eq. \ref{equation:dice}) and Binary Focal Loss ($\mathit{FL}$, eq. \ref{equation:focal}).

\begin{equation}
    \label{equation:dice}
     \mathit{DL} = \frac{(1 + \beta^2) \cdot \mathit{TP}} {(1 + \beta^2) \cdot \mathit{TP} + \beta^2 \cdot \mathit{FN} + \mathit{FP}}
\end{equation}

\noindent where $\beta$ corresponds to a balance coefficient, $\mathit{TP}$, $\mathit{FP}$, and $\mathit{FN}$ represent the true positive, false positive, and false negative voxels, respectively. \textcolor{black}{For all reported experiments, we set $\beta = 1.0$ (equal weighting of precision and recall).}

\begin{equation}
\label{equation:focal}
 \begin{aligned}
 \mathit{FL} =  & - \mathit{GT} \alpha (1 - \mathit{PT})^\gamma \log(\mathit{PT})  & - (1 - \mathit{GT}) \alpha \mathit{PT}^\gamma \log(1 - \mathit{PT})
 \end{aligned}
\end{equation}



\noindent where $\mathit{GT}$ is  the ground-truth values, $\mathit{PT}$ represents the predicted truth, $\alpha=0.25$ and $\gamma=2.0$ are values that were fine-tuned through a calibration process.

\noindent\textbf{Performance Metrics}. The class imbalance between IPH and non-IPH voxels rendered accuracy an unsuitable performance metric, as the large number of true negatives ($\mathit{TN}$) would disproportionately influence the results. To better evaluate model performance, we utilized the $F$-measure, intersection-over-union (IoU), and Hausdorff distance. The $F$-measure, \textit{a.k.a.} dice coefficient for binary segmentation, is a commonly adopted metric in image segmentation:

\begin{equation}
       \text{F} = 2 \times \frac{P \times R}{P+R} 
\end{equation}

\noindent which represents the harmonic mean of precision ($P$) and recall ($R$):

\begin{equation}
       P = \frac{\mathit{TP}}{\mathit{TP}+\mathit{FP}} \;\;\text{and}\;\;
       R = \frac{\mathit{TP}}{\mathit{TP}+\mathit{FN}}. \nonumber
\end{equation}
       
\noindent IoU quantifies the overlap between predictions and ground truth:

\begin{equation}
        \text{IoU} = \frac{\mathit{TP}}{\mathit{FP}+\mathit{TP}+\mathit{FN}}.
\end{equation}

\noindent Throughout training, the model achieving the highest IoU value was retained as optimal.

The Hausdorff distance measures the separation between the predicted and ground-truth IPH boundaries. For two point sets $A$ and $B$ representing these boundaries, the $d_{H95}$ is defined as:

\begin{multline}
    d_H(x,y) = \max\{d_{AB},d_{BA}\}  =  \max\left\{\max_{a\in A}\left\{\min_{b\in B}\{d(a,b)\}\right\}, \max_{b\in B}\left\{\min_{a\in A}\{d(a,b)\}\right\}\right\}
\end{multline}

\noindent where $a$ and $b$ represent elements of sets $A$ and $B$, respectively, and $d(a,b)$ is the Euclidean distance between them. We used the $95^{th}$ percentile of the Hausdorff distance distribution ($d_{H95}$) to assess performance. Superior performance is indicated by higher $F$-measure and IoU values, and a lower $d_{H95}$ value.

\textbf{Implementation Details}. We ran experiments on a four-node cluster (8$\times$ Tesla V100 16GB GPUs; 754 GB total system RAM. To enable a fair comparison with other U-Net variants, each brain projection was trained independently in parallel, substantially reducing the total training time. Model development was carried out using Python 3.6 in Jupyter Notebook, and the resulting code was subsequently converted to Python scripts to enable cluster execution. The full source code and Keras-based implementations are publicly available \footnote{https://github.com/KaueTND/ip-hemorrhagic-stroke-segmentation}.


\subsection{Statistical Analysis}

Five-fold cross-validation was used to evaluate all \textcolor{black}{five} U-Net models, and results are reported as the mean $\pm$ standard deviation. Throughout the analysis, appropriate tests were performed to assess the validity of the model assumptions, including tests of residual normality. Significance level was set to $\alpha = 0.05$ for all statistical tests. The R statistical package (https://www.r-project.org/) was used.

%To evaluate the effect of segmentation style on performance metrics $PM$ (F-measure, IoU, and $d_{H95}$), separate linear regression models were fitted for each segmentation style to assess the relationship between performance metrics and covariates, including ground truth hemorrhage volume (\textit{IPH\_vol}), $Age$, $Sex$, and site variability ($Sitename$). The model structure for each style was:

%\begin{equation}
%PM \sim IPH\_{vol} + Age + Sex + Sitename
%\end{equation}

%Subsequently, a one-way analysis of variance (ANOVA) was conducted to test for overall differences in performance across the four segmentation styles (U-Net, Attention U-Net, FPN, Swin U-Net, and Trans U-Net). Post-hoc pairwise comparisons were performed using t-tests with Bonferroni correction to control for multiple comparisons. 

To evaluate the effect of segmentation architecture on performance metrics PM(F-measure, IoU, and $d_{H95}$), \textcolor{black}{we employed a two-stage linear modelling framework designed to account for substantial imbalance in sample sizes across acquisition sites. All analyses were performed separately for each segmentation architecture.
First, to assess performance as a function of stroke severity in sites with limited sample sizes, we fitted a linear regression model aggregating sites with fewer than 15 cases. In this model, performance metrics were modelled as a function of ground-truth intraparenchymal hemorrhage volume ($IPH_{vol}$), which served as a surrogate measure of lesion severity. Site was not included as a covariate in this model due to insufficient within-site variability and the instability of site-specific estimates in small-sample settings. The model structure was:}

%ASS PUT IN EQUATION FORMAT%

\begin{equation}
PM \sim IPH_{vol}
\end{equation}


\textcolor{black}{Second, to evaluate algorithm performance and generalizability across acquisition sites with adequate sample sizes, a separate linear regression model was fitted, including only sites with more than 15 cases. In this model, site was included as a categorical fixed effect to account for acquisition-related variability, alongside hemorrhage volume and demographic covariates:}

%ASS PUT IN EQUATION FORMAT%
\begin{equation}
PM \sim IPH_{vol}+ Age + Sex +Sitename
\end{equation}


This model was used to assess whether segmentation performance varied systematically across sites under conditions in which site-specific effects could be reliably estimated.

To test for overall differences in mean performance across segmentation architectures (U-Net, Attention U-Net, FPN, Swin U-Net, and Trans U-Net), a one-way analysis of variance (ANOVA) was performed for each performance metric. When the omnibus ANOVA indicated a significant effect of segmentation architecture, post-hoc pairwise comparisons were conducted using t-tests with Bonferroni correction to control for multiple comparisons.

\subsection{\textcolor{black}{Clinically-specific Analysis}}

\textcolor{black}{
Performance analyses were stratified by hemorrhage volume and anatomical adjacency (sulci, ventricles, and others) to ensure model comparisons reflect clinical and methodological variability. IPH volume was categorized as small ($<$5 mL), moderate (5–30 mL), or large ($>$30 mL) because lesion size can potentially affect voxel-wise learning and error interpretation. In small IPH volumes, a single voxel misclassification can lead to a large relative volumetric error and significantly affect clinical severity estimates. For large IPH volumes, the same error can have less effect on volume estimates but may still influence boundary precision. We reported F-Measure and IoU for volumetric overlap, and  $d_{H95}$ for boundary delineation.}




%--------------------------------------+
   \section{Results}                  %|
         \label{sec:results}          %|
%--------------------------------------+

%\subsection{Multi-site Generalization Performance}

\subsection{\textcolor{black}{Segmentation Performance Across Architectures}}

\textcolor{black}{Segmentation performance was evaluated using F-measure, Intersection-over-Union (IoU), and the 95th percentile Hausdorff distance ($d_{H95}$). Analyses were conducted separately for sites with limited sample sizes and sites with adequate sample sizes to account for substantial imbalance in site-level representation.}

\paragraph{Severity-dependent performance in small-sample sites.}
\textcolor{black}{For sites with fewer than 15 cases, performance was evaluated using linear regression models relating each performance metric to ground-truth intraparenchymal hemorrhage volume (\textit{IPH\_vol}), which served as a surrogate measure of lesion severity. Demographic variables and site were not included in these models due to insufficient within-site variability and the instability of site-specific estimates in small-sample settings.}

\textcolor{black}{Across all segmentation architectures, increasing \textit{IPH\_vol} was associated with improved overlap-based performance. Specifically, significant positive associations between \textit{IPH\_vol} and F-measure were observed for all models (U-Net: $\beta = 1.98 \times 10^{-6}$, $p < 0.001$; Attention U-Net: $\beta = 1.37 \times 10^{-6}$, $p < 0.001$; FPN: $\beta = 1.42 \times 10^{-6}$, $p = 0.001$; Trans U-Net: $\beta = 1.41 \times 10^{-6}$, $p = 0.001$; Swin U-Net: $\beta = 2.30 \times 10^{-6}$, $p = 0.009$). Similar severity-dependent improvements in IoU were observed across all architectures (all $p < 0.01$).}

\textcolor{black}{For boundary accuracy, no significant associations between \textit{IPH\_vol} and $d_{H95}$ were observed in the small-site cohort across architectures (all $p > 0.16$), reflecting the limited sensitivity of boundary-based metrics in sparse, heterogeneous samples.}

\paragraph{Generalizability across large-sample sites.}
\textcolor{black}{For sites with at least 15 cases, segmentation performance was assessed using linear regression models that included \textit{IPH\_vol}, age, sex, and site as fixed effects. In these models, \textit{IPH\_vol} remained a significant predictor of performance across all architectures and metrics. Larger hemorrhage volumes were consistently associated with higher F-measure and IoU values (all $p < 10^{-4}$) and lower $d_{H95}$ values (all $p < 0.05$), indicating improved overlap and boundary accuracy for larger lesions.}

\textcolor{black}{No significant effects of age, sex, or acquisition site were observed for any performance metric after accounting for multiple comparisons (all $p > 0.05$). Although isolated site-specific coefficients reached nominal significance in uncorrected analyses, these effects did not survive correction for multiple testing and were not consistent across segmentation architectures or metrics. Collectively, these findings indicate that segmentation performance was not systematically influenced by demographic factors or acquisition site, supporting the robustness and generalizability of all models across heterogeneous clinical imaging conditions.}

\paragraph{Comparison of segmentation architectures.}
\textcolor{black}{To assess overall differences in mean performance across segmentation architectures, one-way analyses of variance (ANOVA) were conducted on the full dataset for each performance metric. Significant main effects of architecture were observed for all metrics, including F-measure ($F(4,1190) = 7.17$, $p = 1.06 \times 10^{-5}$), IoU ($F(4,1190) = 7.84$, $p = 3.1 \times 10^{-6}$), and $d_{H95}$ ($F(4,1189) = 6.17$, $p = 6.51 \times 10^{-5}$).}

\textcolor{black}{Post-hoc pairwise comparisons with Bonferroni correction revealed that Attention U-Net, FPN, and Trans U-Net significantly outperformed the baseline U-Net for both F-measure and IoU (all adjusted $p < 0.01$). Swin U-Net demonstrated significantly lower overlap performance compared to the other architectures (all adjusted $p < 0.01$). No significant differences were observed among the three advanced convolutional architectures (Attention U-Net, FPN, and Trans U-Net) after correction for multiple comparisons.}

\textcolor{black}{For $d_{H95}$, FPN achieved significantly lower boundary error relative to the baseline U-Net ($p = 0.0287$), whereas no significant differences were observed between U-Net and Attention U-Net ($p = 0.118$) or Trans U-Net ($p = 1.000$). In contrast, Swin U-Net exhibited significantly higher boundary error compared to Attention U-Net ($p = 0.0017$) and FPN ($p = 0.00024$), while differences relative to Trans U-Net ($p = 0.228$) and U-Net ($p = 1.000$) were not significant. Mean performance values for each metric and architecture are reported in Tables~\ref{tab:fmeasure}, \ref{tab:iou}, and \ref{tab:hausdorff}.}


\paragraph{Clinically relevant stratification.}
\textcolor{black}{Figure~\ref{fig:site} illustrates F-measure performance stratified by acquisition site and segmentation architecture. In addition, Figure~\ref{fig:anat_hemo} evaluates segmentation performance across clinically relevant stratification, including anatomical hemorrhage location (e.g., near sulci, ventricles, or other regions) and hemorrhage volume categories ($<$5~mL, 5--30~mL, and $>$30~mL), further demonstrating consistent performance trends across lesion characteristics.}








\begin{table}[b]
    \centering
    \caption{F-Measure for IPH Segmentation. Performance is compared across four CNN variants (Attention U-Net, baseline U-Net, FPN, Swin U-Net, and Trans U-Net) on axial (2DAxi), coronal (2DCor), and sagittal (2DSag) projections, and their ensemble (2.5D). Results are reported as Mean $\pm$ Standard Deviation. The best-performing model for each orientation is highlighted in \textbf{bold}.}
\resizebox{\linewidth}{!}{%    
    \begin{tabular}{c|ccccc}
Style & Attention U-Net & FPN & Swin U-Net & Trans U-Net & U-Net \\\hline
        Orientation &  &  &  \\\hline
2DAxi & 0.609 $\pm$ 0.244 & 0.635 $\pm$ 0.225 & \textbf{0.649 $\pm$ 0.170} & 0.629 $\pm$ 0.228 & 0.617 $\pm$ 0.221 \\
2DCor & 0.865 $\pm$ 0.129 & \textbf{0.868 $\pm$ 0.127} & 0.845 $\pm$ 0.135 & 0.863 $\pm$ 0.132 & 0.819 $\pm$ 0.153 \\
2DSag & 0.849 $\pm$ 0.155 & \textbf{0.854 $\pm$ 0.140} & 0.822 $\pm$ 0.150 & 0.847 $\pm$ 0.145 & 0.811 $\pm$ 0.164 \\
2.5D & 0.851 $\pm$ 0.165 & \textbf{0.862 $\pm$ 0.146} & 0.842 $\pm$ 0.139 & 0.855 $\pm$ 0.151 & 0.823 $\pm$ 0.164 \\\hline
    \end{tabular}}

    \label{tab:fmeasure}
\end{table}

\begin{table}[b]
    \centering
    \caption{IoU for IPH Segmentation. Performance is compared across four CNN variants (Attention U-Net, baseline U-Net, FPN, Swin U-Net, and Trans U-Net) on axial (2DAxi), coronal (2DCor), and sagittal (2DSag) projections, and their ensemble (2.5D). Results are reported as Mean $\pm$ Standard Deviation. The best-performing model for each orientation is highlighted in \textbf{bold}.}
\resizebox{\linewidth}{!}{%    
    \begin{tabular}{c|ccccc}
Style & Attention U-Net & FPN             & Swin U-Net & Trans U-Net     & U-Net \\\hline
        Orientation &  &  &  \\\hline
2DAxi & 0.479 $\pm$ 0.236 & 0.502 $\pm$ 0.225 & \textbf{0.504 $\pm$ 0.189} & 0.496 $\pm$ 0.227 & 0.481 $\pm$ 0.221 \\
2DCor & 0.780 $\pm$ 0.164 & \textbf{0.785 $\pm$ 0.161} & 0.751 $\pm$ 0.176 & 0.777 $\pm$ 0.167 & 0.718 $\pm$ 0.187 \\
2DSag & 0.762 $\pm$ 0.182 & \textbf{0.767 $\pm$ 0.173} & 0.721 $\pm$ 0.189 & 0.757 $\pm$ 0.179 & 0.709 $\pm$ 0.194 \\
2.5D & 0.767 $\pm$ 0.192 & \textbf{0.779 $\pm$ 0.176} & 0.749 $\pm$ 0.177 & 0.770 $\pm$ 0.182 & 0.726 $\pm$ 0.194 \\\hline
    \end{tabular}}
    \label{tab:iou}
\end{table}

\begin{table}[b]
    \centering
    \caption{$d_{H95}$ for IPH Segmentation. Performance is compared across four CNN variants (Attention U-Net, baseline U-Net, FPN, Swin U-Net, and Trans U-Net), evaluated on axial (2DAxi), coronal (2DCor), sagittal (2DSag) projections, and their ensemble (2.5D). Results are reported as Mean $\pm$ Standard Deviation. Values are in $mm$. The best-performing model for each orientation is highlighted in \textbf{bold}.}
\resizebox{\linewidth}{!}{%    
    \begin{tabular}{c|ccccc}
Style       & Attention U-Net   & FPN          & Swin U-Net    & Trans U-Net      & U-Net \\\hline
Orientation &  &   &  &  \\\hline
2DAxi & \textbf{10.066 $\pm$ 9.643} & 11.624 $\pm$ 8.507 & 11.907 $\pm$ 7.785 & 10.371 $\pm$ 8.964 & 11.465 $\pm$ 9.336 \\
2DCor & 3.376 $\pm$ 9.066 & \textbf{2.916 $\pm$ 7.464} & 6.265 $\pm$ 10.987 & 4.855 $\pm$ 12.558 & 5.862 $\pm$ 11.051 \\
2DSag & 4.578 $\pm$ 11.038 & \textbf{4.203 $\pm$ 10.113} & 9.488 $\pm$ 15.439 & 5.594 $\pm$ 10.934 & 6.871 $\pm$ 10.526 \\
2.5D & 1.600 $\pm$ 4.974 & \textbf{1.574 $\pm$ 4.550} & 2.676 $\pm$ 8.557 & 1.615 $\pm$ 4.837 & 2.542 $\pm$ 6.356 \\\hline
    \end{tabular}}
    \label{tab:hausdorff}
\end{table}

\begin{figure}
    \centering
    \includegraphics[width=0.75\linewidth]{f-measure_by_style.pdf}
    \caption{%Qualitative comparison of IPH segmentation results across different CNN variants and imaging orientations. Two patients (A--B) are shown, representing large and small IPH volumes, respectively. Rows 1--4 show Patient A (85-year-old male, IPH volume: 144 cm$^3$), rows 5--8 show Patient B (79-year-old female, IPH volume: 7.9 cm$^3$), %and rows 7--9 show Patient C (80-year-old male, IPH volume: 0.8 cm$^3$). 
    %Each row group illustrates segmentations from axial, coronal, sagittal, and 2.5D views using Attention U-Net, baseline U-Net, FPN, Swin U-Net, and Trans U-Net models. F-Measure value is reported in the top left corner.
    \textcolor{black}{Qualitative comparison of IPH segmentation results across different CNN variants. Three IPH volume categories are shown: (a) $<$5 mL IPH volume, (b) 5–30 mL IPH volume, and (c) $>$30 mL IPH volume. Within each volume category, the first row corresponds to IPH located near the ventricles, the second row to IPH near the sulci, and the third row to IPH in other anatomical regions. F-Measure value is reported in the top left corner. Red, blue, and green colors corresponds to TP,FP,FN, respectively. Masks are overlaid in the original volume (no skull-stripped) for better visualization.}}
    \label{fig:segmentation}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=1\linewidth]{Hemorrhage_size.pdf}
    \caption{Scatter plot of subject age (x-axis) and F-Measure (y-axis) for IPH segmentation. Each point represents an individual subject, with marker size proportional to the IPH volume ($mL$) computed from the model's predicted segmentation mask. Colours go from red (underestimated according to ground-truth) $\rightarrow$ green (correctly labelled) $\rightarrow$ blue (overestimation compared to ground-truth).}
    \label{fig:hemorrhagesize}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=1\linewidth]{Fmeasure_Site_Orientation.pdf}
    \caption{Boxplot comparison of F-Measure scores for IPH segmentation across the 17 clinical sites (A-Q), stratified by CNN variant.} %$N$ indicates the sample size per site.}
    \label{fig:site}
\end{figure}

\begin{figure}
    \centering
    \begin{tabular}{cc}
    \includegraphics[width=0.48\linewidth]{F-Measure_R_hemorrhage.pdf}     &  
    \includegraphics[width=0.48\linewidth]{F-Measure_R_anatomical.pdf}\\
    (a)&(b)
    \end{tabular}
    
    \caption{\textcolor{black}{Boxplot comparison of F-Measure scores for IPH segmentation across CNN variants. (a) Performance stratified by hemorrhage volume ranges: small ($<$5 mL), medium (5–30 mL), and large ($>$30 mL). (b) Performance stratified by IPH anatomical location, grouped as lesions adjacent to sulci, ventricles, or other brain regions. Each box represents the distribution of subject-level F-Measure values for a given CNN variant within each subgroup.}}
    \label{fig:anat_hemo}
\end{figure}
%-------------------------------------+
   \section{Discussion}              %|
   \label{sec:discussion}            %|
%-------------------------------------+

% new discussion

In this study, we compared and evaluated different CNN variants for IPH segmentation, with a particular focus on generalizability across multi-site clinical data. Performance was assessed using three commonly reported metrics in the literature (F-measure, IoU, and $d_{H95}$). Using a two-stage linear modelling framework followed by ANOVA-based architecture comparisons, we found that Attention U-Net, FPN, and Trans U-Net significantly outperformed the baseline U-Net in terms of volumetric overlap, as measured by F-measure and IoU \textcolor{black}{(ANOVA: F-measure $F(4,1190)=7.17$, $p=1.06\times10^{-5}$; IoU $F(4,1190)=7.84$, $p=3.1\times10^{-6}$). In contrast, Swin U-Net demonstrated significantly lower overlap performance relative to the other architectures (Bonferroni-corrected $p<0.01$ for all pairwise comparisons with Attention U-Net, FPN, and Trans U-Net).} 

\textcolor{black}{For boundary detection, architectural differences were more selective. FPN achieved a statistically significant reduction in $d_{H95}$ relative to the baseline U-Net (Bonferroni-corrected $p=0.0287$), whereas Attention U-Net ($p=0.118$) and Trans U-Net ($p=1.000$) did not show significant improvements. Swin U-Net exhibited significantly higher boundary error compared to Attention U-Net ($p=0.0017$) and FPN ($p<0.001$), while differences relative to Trans U-Net ($p=0.228$) and U-Net ($p=1.000$) were not statistically significant. These patterns are summarized in Tables~\ref{tab:fmeasure}–\ref{tab:hausdorff}. The 2DCor orientation achieved higher F-measure and IoU scores, whereas the 2.5D approach yielded lower $d_{H95}$ values, indicating improved boundary alignment with ground truth.}

\textcolor{black}{Across sites with adequate sample sizes, the advanced convolutional architectures (Attention U-Net, FPN, and Trans U-Net) demonstrated comparable performance in F-measure and IoU, with no statistically significant differences observed after correction for multiple comparisons (all Bonferroni-corrected $p \geq 0.684$). Swin U-Net, however, consistently underperformed these models across both overlap-based metrics (all Bonferroni-corrected, $p<0.01$). These findings indicate that while multiple architectures achieve robust volumetric overlap across heterogeneous acquisition conditions, transformer-based modelling alone does not guarantee improved performance for IPH segmentation in this setting.}

\textcolor{black}{In the large-site cohort, no statistically significant effects of age, sex, or acquisition site were observed for any performance metric after correction for multiple comparisons (all corrected $p>0.05$). Although isolated site-level coefficients reached nominal significance in uncorrected analyses, these effects were inconsistent across architectures and metrics and did not survive correction. This absence of systematic demographic or site effects supports the robustness and generalizability of the evaluated models across heterogeneous clinical imaging conditions.}

\textcolor{black}{Volume- and anatomy-stratified analyses (Figure~\ref{fig:anat_hemo}) further contextualized these findings. Small hemorrhages ($<$5~mL) posed the greatest challenge for both overlap-based and boundary-based metrics, whereas moderate (5–30~mL) and large ($>$30~mL) hemorrhages exhibited reduced variability and improved performance. These patterns align with the observed positive associations between IPH volume and F-measure/IoU and the negative trends observed for $d_{H95}$. Importantly, the boundary-precision advantage of FPN persisted across volume and anatomical regions, whereas Swin U-Net consistently demonstrated higher boundary error across these clinically relevant subgroups. Segmentation performance improved with increasing hemorrhage volume across all models. Ground-truth IPH volume showed significant positive associations with F-measure and IoU across architectures (all $p<0.001$), reflecting improved overlap for larger lesions. Associations between IPH volume and $d_{H95}$ were weaker and model-dependent; while negative trends were observed, these did not consistently survive correction for multiple comparisons. Collectively, these results indicate that all models perform better on larger, more confluent hemorrhages, which are inherently easier to segment.}

\textcolor{black}{The examination of the volume- (Figure \ref{fig:anat_hemo}.a) and anatomic-stratified (Figure \ref{fig:anat_hemo}.b) results aligned with our statistical findings. Smaller IPH volumes are more challenging for voxel-wise overlap (\textit{e.g.}, F-Measure, and IoU) and boundary metrics ($d_{H95}$). This explains the larger variability in our metric evaluation. Moderate to large IPH volumes tend to have greater overlap, which helps smooth potential boundary errors. There is a positive association between IPH volume and F-Measure/IoU and a negative association with $d_{H95}$. While examining the anatomical context, we noticed that IPH volumes adjacent to cortical sulci pose additional segmentation challenges due to potential partial-volume effects, whereas hemorrhages near the ventricles tend to exhibit more consistent contrast, resulting in stable performance across models. 
}

Our multi-site validation addresses a critical gap in previous segmentation studies, which were often limited to single-institution or publicly available datasets \cite{inkeaw2022automatic}. The consistent performance across sites suggests that the advanced architectures learn feature representations that are robust to site-specific variations in imaging protocols, making them suitable for broader clinical deployment. 

When contextualized within the broader literature, our multi-site results demonstrate competitive, if not superior, performance and generalizability. While \cite{inkeaw2022automatic} reported a median Dice coefficient of $0.37$ for IPH segmentation (in multi ICH segmentation) and \cite{lin2025attention} achieved Dice scores around $0.91$ for cerebral contusion segmentation, our FPN model achieved an F-measure of $0.868$ (in Dice metric) while demonstrating robust multi-site performance. The CNN variant efficiencies of models like FPN, Attention U-Net, and Trans U-Net suggest a promising path toward developing solutions that are both highly accurate and computationally feasible for real-time use in emergency settings across multiple healthcare institutions \cite{piao2023transhardnet}.

\textcolor{black}{In clinical practice, segmentation accuracy is more than just a technical measure; it directly affects important decisions. Errors in IPH segmentation can impact how we estimate hematoma volume, assess midline shift, measure edema, and track changes over time. These steps are crucial for predicting outcomes, choosing treatments, and planning surgery. Even small errors at the boundaries can lead to overestimation or underestimation of hemorrhage volume, a key factor in predicting mortality and determining whether a patient qualifies for minimally invasive procedures. Missing small or irregular bleeds can delay diagnosis or hide early hematoma growth, while false positives can make the condition seem worse than it is and lead to unnecessary treatments. This shows why it is important to reduce boundary errors (measured by $d_{H95}$) and improve volumetric overlap (measured by $ F$-Score and $IoU$), especially for small or changing hemorrhages, where treatment decisions are most affected by segmentation uncertainty. }


% This study aimed to develop and evaluate advanced deep learning models for the automated segmentation of intraparenchymal hemorrhage (IPH) from CT images. The primary findings demonstrate that both the Attention U-Net and Feature Pyramid Network (FPN) architectures significantly outperform the standard U-Net baseline across all performance metrics, including F-measure, Intersection over Union (IoU), and the 95th percentile Hausdorff Distance ($d_{H95}$). The statistical evidence strongly supports integrating attention mechanisms and multi-scale feature fusion for this clinically critical task.

% The superior performance of the Attention U-Net and FPN models can be attributed to their enhanced architectural designs, which directly address key challenges in IPH segmentation. The Attention U-Net incorporates attention gates that dynamically weight feature maps, allowing the model to focus computational resources on salient hemorrhagic regions while suppressing irrelevant background information. This is particularly beneficial for segmenting IPH, which often presents with irregular borders and heterogeneous intensities \cite{Lin2025Advanced}. Our results confirm this, with the Attention U-Net yielding significantly higher F-measure (estimate = $0.05370$, $p < 0.001$) and IoU (estimate = $0.07059$, $p < 0.001$) compared to the baseline. Similarly, the FPN architecture, by leveraging a feature pyramid, excels at capturing multi-scale contextual information. This enables the model to effectively recognize hemorrhagic lesions of varying sizes — from small punctate bleeds to large confluent hematomas — by combining high-resolution semantic features with rich spatial details from earlier layers. The FPN's performance, slightly edging out the Attention U-Net in reducing boundary error ($d_{H95}$: $-4.445$ vs. $-3.648$), underscores the advantage of explicit multi-scale feature representation for precise boundary delineation.

% A crucial and nuanced finding of our analysis is the significant interaction between segmentation style and ground truth hemorrhage volume (GT\_vol). While a larger GT\_vol was independently associated with better performance across all models (e.g., F-measure estimate = $1.863 \times 10^{-06}$, $p < 0.001$), the \emph{benefit} of using an advanced architecture was more pronounced for minor hemorrhages. This is evidenced by the significant negative interaction terms between style and GT\_vol for both F-measure and IoU (e.g., FPN: $-4.892 \times 10^{-07}$, $p < 0.001$). In practical terms, this implies that while a standard U-Net may perform adequately on large, obvious hematomas, the Attention U-Net and FPN offer a substantial advantage in the more challenging, clinically ambiguous cases involving smaller IPH volumes. This is a critical advancement, as early and accurate detection of minor hemorrhages can significantly influence patient management and outcome \cite{Inkeawy2022Automatic}. The superior performance on smaller lesions suggests that our advanced models are better at leveraging subtle textural and contextual cues that are often missed by simpler architectures.

% The post-hoc analyses further solidify the consistency of these findings. The pairwise comparisons revealed no statistically significant differences between the Attention U-Net and FPN on any metric ($p > 0.05$), indicating that both architectural advancements offer comparable, considerable performance gains over the baseline U-Net. The estimated marginal means provide a clear clinical interpretation of this improvement: the FPN model achieved an F-measure of $0.864$ and an IoU of $0.780$, compared to $0.813$ and $0.711$ for the standard U-Net, respectively. Perhaps more impactful is the improvement in boundary precision, where the $d_{H95}$ was nearly halved from $7.56$ mm for the U-Net to $3.61$ mm for the FPN. Accurate boundary definition is paramount for volume quantification, which is a key prognostic indicator in IPH \cite{roy2015intraparenchymal}.

% When contextualized within the broader literature, our results demonstrate competitive, if not superior, performance. For instance, \cite{Inkeawy2022Automatic} reported a median Dice coefficient of $0.37$ for IPH segmentation using a 3D DeepMedic model, while \cite{Lin2025Advanced} achieved a Dice score of $0.91$ for cerebral contusion (a form of IPH) using an Attention-based ResU-Net. Our FPN model's F-measure of $0.864$ (a Dice-equivalent metric) places it firmly within the higher performance range reported in recent studies. The work by \cite{piao2023transhardnet} emphasized the importance of inference speed for clinical deployment. While our current study focused on segmentation accuracy, the architectural efficiencies of models like FPN suggest a promising path toward developing solutions that are both highly accurate and computationally feasible for real-time use in emergency settings.

% Despite the promising results, certain limitations must be acknowledged. The negative interaction between model sophistication and hemorrhage volume, while highlighting a strength, also indicates that segmenting very small or nascent bleeds remains challenging even for advanced models. Future work could explore integrating progressive resolution training or focal loss to enhance sensitivity to minor hemorrhagic foci further. Furthermore, while our models were rigorously validated, external testing on multi-center, multi-scanner datasets is essential to confirm generalizability across diverse clinical environments and patient populations.



%%%%%%%%%old part 2

% In this study, we compared and evaluated different CNN variants for IPH segmentation, with a particular focus on generalizability across data from other sites. We evaluated three metrics (F-Measure, IoU, $d_{H95}$) because they are the most relevant in the literature. We conducted four independent linear regression analyses, followed by ANOVA comparisons, and found that the advanced models (Attention U-Net, FPN, and Trans U-Net) significantly outperformed the baseline U-Net across F-Measure and IoU metrics. For boundary error ($d_{H95}$), Attention U-Net and FPN showed significant improvement over baseline, while Trans U-Net did not demonstrate a statistically significant reduction. This consistent performance across different metrics indicates that advanced models tend to estimate the IPH volume more accurately. This pattern is also well-identified in Tables \ref{tab:fmeasure}-\ref{tab:hausdorff}. The 2DCor orientation achieved higher F-Measure and IoU scores, whereas 2.5D achieved the lowest $d_{H95}$, indicating better alignment of the IPH boundary with the ground truth. These patterns can also be highlighted in Figure \ref{fig:segmentation}, where there is an increasing number of FP and FN in 2DAxi, and fewer in 2DCor and 2.5D.

% \textcolor{black}{In clinical practice, segmentation accuracy is more than just a technical measure, it directly affect important decision. Errors in IPH segmentation can impact how we estimate hematoma volume, assess midline sift, measure edema, and track changes over time. These steps are crucial for predicting outcomes, choosing treatments, and planning surgery. Even small errors at the boundaries can lead to overestimation or underestimation of hemorrhage volume, a key factor in predicting mortality and determining whether a patient qualifies for minimally invasive procedures. Missing small or irregular bleeds can delay diagnosis or hide early hematoma growth, while false positives can make the condition seem worse than it is and lead to unnecessary treatments. This shows why it is important to reduce boundary errors (evaluated by $d_{H95}$) and improve volumetric overlap (evaluated by $F$-measure and $IoU$), especially for small or changing hemorrhages, where treatment decisions are most affected by segmentation uncertainty. }


% The Attention U-Net integrates attention blocks that adjust segmentations using high-resolution features from the encoding layers (via skip connections). On the other hand, FPN can capture multi-scale contextual information and identify texture patterns that the Attention U-Net sometimes misses. Trans U-Net, with its transformer-based architecture, is designed to capture long-range dependencies and may provide additional contextual information. In IPH segmentation, these types of models are suitable, as IPH often exhibits irregular borders and heterogeneous intensities \cite{lin2025attention}.

% In our multi-site comparison across CNN variants, the advanced models (Attention U-Net, FPN, and Trans U-Net) showed minimal significant differences in F-Measure and IoU, indicating generalizability even when protocols and scanner vendors differ. For boundary error ($d_{H95}$), Trans U-Net did not show a significant difference compared to baseline, while Attention U-Net and FPN did. This is of particular interest, as these models are often deployed in external sources where the protocol can vary drastically. On the other hand, the baseline U-Net showed minimal site preferences, which did not affect F-Measure scores.

% Although the models showed better performance with larger, more confluent lesions, there was a drastic reduction in $d_{H95}$ with Attention U-Net and FPN compared to baseline U-Net, along with better detection of smaller lesions. Although not statistically significant, Trans U-Net performed well in regards of $d_{H95}$ compared to baseline.  There is a consistent positive association between ground-truth IPH volume and segmentation performance across all models, as indicated by F-Measure ($p < 0.001$) and IoU ($p < 0.001$). While demonstrating significant negative associations with $d_{H95}$ across all styles (U-Net: $p = 0.030$; Attention U-Net: $p = 0.020$; FPN: $p = 0.052$; Trans U-Net: $p = 0.022$). This consistent pattern indicates that all models perform better on larger IPH, which are generally easier to segment. However, the advanced models (Attention U-Net, FPN, and Trans U-Net) showed higher F-Measure and IoU means when compared across the same patients.

% \textcolor{black}{The examination of the volume- (Figure \ref{fig:anat_hemo}.a and anatomic-stratified (Figure \ref{fig:anat_hemo}.b) results aligned with our statistical findings. Smaller IPH ($<$5mL) volumes is more challenging for voxel-wise overlap (\textit{e.g.}, F-Measure, and IoU) and boundary metrics ($d_{H95}$). Conversely, the analysis of moderate (5-30mL) and larger ($>$30mL) indicates This explain the larger variability in our metric evaluation  which explains the observed positive association between ground-truth IPH volume and F-Measure/IoU and the negative association with $d_{H95}$, and underscores why improvements in boundary precision (lower $d_{H95}$) are particularly valuable for small lesions where volumetric error has outsized clinical impact; likewise, anatomical context matters methodologically—lesions adjacent to cortical sulci pose additional segmentation difficulty due to partial-volume effects and complex cortical geometry, whereas hemorrhages abutting the ventricles tend to present more consistent contrast and therefore more stable performance across models—critically, the boundary-precision gains achieved by Attention U-Net and FPN persist across these clinically relevant volume and location strata, indicating that their advantages are not limited to large, easy-to-segment cases but translate into methodological robustness that can reduce clinically meaningful volumetric and boundary errors. 
% }


% The low $d_{H95}$ values in Attention U-Net and FPN demonstrated consistent boundary detection across IPH volumes, as evidenced by non-significant interaction terms ($p > 0.05$) and strong main effects. This indicates that the boundary-precision advantages of Attention U-Net, FPN, and Trans U-Net are maintained regardless of IPH burden, highlighting their robust generalization across varying pathology loads in multi-site applications. 

% Our post-hoc analyses confirmed our findings regarding multi-site consistency. The pairwise comparisons revealed no statistically significant differences among the advanced models (Attention U-Net, FPN, and Trans U-Net) for F-measure and IoU ($p \geq 0.684$). For $d_{H95}$, there were no significant differences among the advanced models ($p \geq 0.162$), but only Attention U-Net and FPN showed a significant improvement over the baseline U-Net ($p < 0.05$). This improvement was noticeable when evaluating $d_{H95}$, showing a reduction of 42\% and 50\% in IPH boundary error for Attention U-Net and FPN, respectively. Accurate boundary definition is key to differentiate healthy tissue from affected tissue, as well as being essential for volume calculation \cite{roy2015intraparenchymal}.

% Although the findings for site-specific analysis were promising, minimal site-specific effects were observed in our linear regression models. We observed isolated sites showed marginal effects (\textit{i.e.}, Site F for F-measure in baseline U-Net: $p = 0.0389$ and in Trans U-Net: $p = 0.0722$; Site N for $d_{H95}$ in U-Net: $p = 0.004$, FPN: $p = 0.004$, and Trans U-Net: $p = 0.001$). This indicates a potential room for improvement. Our multi-site validation addresses a critical gap in previous segmentation studies, which were often limited to single-institution or publicly available datasets \cite{inkeaw2022automatic}. The consistent performance across sites suggests that the advanced architectures learn feature representations that are robust to site-specific variations in imaging protocols, making them suitable for broader clinical deployment.

% When contextualized within the broader literature, our multi-site results demonstrate competitive, if not superior, performance and generalizability. While \cite{inkeaw2022automatic} reported a median Dice coefficient of $0.37$ for IPH segmentation (in multi ICH segmentation) and \cite{lin2025attention} achieved Dice scores around $0.91$ for cerebral contusion segmentation, our FPN model achieved an F-measure of $0.868$ (in Dice metric) while demonstrating robust multi-site performance. The CNN variant efficiencies of models like FPN, Attention U-Net, and Trans U-Net suggest a promising path toward developing solutions that are both highly accurate and computationally feasible for real-time use in emergency settings across multiple healthcare institutions \cite{piao2023transhardnet}.


%-------------------------------------+
   \section{Summary and Conclusions} %|
   \label{sec:conclusion}            %|
%-------------------------------------+

In this work, we investigated the use of CNN variants for IPH segmentation, aligning with current findings on the best techniques in the literature. In essence, we tested statistical models to identify the best CNN variant that \textcolor{black}{accounts for lesion severity and multi-site data heterogeneity}.

Our findings demonstrate that Attention U-Net, FPN, and Trans U-Net significantly improve automated IPH segmentation relative to the baseline U-Net, as measured by F-measure and IoU. \textcolor{black}{In contrast, improvements in boundary definition were more selective. Only FPN achieved a statistically significant reduction in boundary error ($d_{H95}$) relative to the baseline U-Net after correction for multiple comparisons, whereas Attention U-Net and Trans U-Net did not show significant boundary improvements. Swin U-Net consistently underperformed relative to the convolutional architectures for both overlap- and boundary-based metrics when compared with Attention U-Net, Trans U-Net, and FPN}. Collectively, these results indicate that architectural advances improve volumetric accuracy, but gains in boundary precision are not uniform across models.

\textcolor{black}{Performance improvements were most evident for larger, more confluent hemorrhages, while segmentation of small or irregular IPH remains challenging across all architectures. No significant effects of age, sex, or acquisition site were observed after correction for multiple comparisons, supporting the robustness and generalizability of the evaluated models across heterogeneous clinical imaging conditions in Canada. By enabling more accurate and reliable IPH segmentation across sites, these models have the potential to reduce reliance on labor-intensive manual delineation and streamline acute stroke workflows.}

Despite these promising results, several limitations remain. \textcolor{black}{Segmentation of very small or early-stage hemorrhages continues to pose difficulties, suggesting opportunities for future investigation.} Additionally, while the present multi-site evaluation supports generalizability within a national context, \textcolor{black}{external validation on geographically and demographically distinct datasets will be essential to confirm broader clinical applicability.}

\bibliography{midl26_40}

\end{document}
