\section{Results}
\label{chap:results}

We evaluate PaSAL on three tasks: (i) artery-vein segmentation on HiPaS, (ii) anatomical labeling on PTL, and (iii) full-pipeline clinical viability on a longitudinal radiotherapy cohort. All per-case metrics, extended visualizations, and correlation plots are provided in Appendix~\ref{chap:plainresults}.

\subsection{Segmentation}

Segmentation performance on the HiPaS test set (13 scans) is summarized in Table~\ref{tab:segmentation_metrics}. Arteries consistently outperform veins across Dice, Sensitivity, Precision, and HD95, with Dice values around $90\%$ for both structures. Slightly lower venous performance is attributable to weaker venous skeleton priors and less reliable graph-root selection during hierarchy construction. Several lower-Dice outliers correspond to anatomically valid distal branches missing from the HiPaS ground truth rather than true model errors. 

Overall values appear lower than those reported by \citet{chu2025deep}; however, a direct comparison is not meaningful because the full target annotations used in their study were not released publicly. While the original scans are available, the publicly released annotations exclude distal vessels, resulting in a different target definition. In this work, segmentation metrics are therefore reported on the released Level-3 targets, which enable quantitative evaluation but do not capture performance on the full peripheral vascular tree. Consequently, we do not include a separate segmentation baseline, as comparisons restricted to these targets would neither reflect our primary objective of assessing the clinical usefulness of complete vascular tree predictions nor support our secondary objective of analyzing the relationship between quantitative metrics and expert-perceived clinical quality.

\begin{table}[t]
    \centering
    \scriptsize
    \renewcommand{\arraystretch}{1.15}
    \caption{Segmentation performance on the HiPaS test set (13 scans). Values are mean (standard deviation).}
    \label{tab:segmentation_metrics}
    \begin{tabular}{lcc}
        \toprule
        \textbf{Metric} & \textbf{Artery} & \textbf{Vein} \\
        \midrule
        Dice & 90.0 (2.2) & 88.7 (2.5) \\
        Sensitivity & 91.9 (3.4) & 90.0 (3.3) \\
        Precision & 88.4 (4.4) & 87.8 (5.9) \\
        HD95 (mm) & 3.62 (3.42) & 6.77 (6.46) \\
        \bottomrule
    \end{tabular}
\end{table}

Expert evaluation of PaSAL's segmentation on nine HiPaS scans yielded high scores across all categories (Table~\ref{tab:qual_seg_summary}). Ratings ranged from 3.6--4.0, indicating strong perceived accuracy, robustness, and practical clinical utility. Within this high-performance regime, voxelwise metrics did not correlate with expert judgment (maximum Spearman correlation $|\rho|=0.52$, vein Dice vs.\ vessel branch abundance, $p=0.15$; most $|\rho|<0.25$, see Appendix Table~\ref{tab:seg-hipas-corr} and Figs.~\ref{fig:segmentation_correlation_appendix}--\ref{fig:segm_scatter_vs_expert_appendix}). This suggests that for segmentation methods already achieving Dice values near $90\%$, further improvements in overlap or distance metrics do not reliably reflect perceived clinical quality.

We note that PaSAL predicts vessels up to Level-3, and both quantitative metrics and expert assessments are therefore derived from these predictions. Certain expert criteria, including vessel branch abundance, also consider distal branches that are not predicted by the model, which may partly explain the weak correlations observed.


\begin{table}[t]
\centering
\scriptsize
\renewcommand{\arraystretch}{1.15}
\caption{Expert assessment of artery/vein segmentation on nine HiPaS scans. Scores are mean (standard deviation) on a 0--5 scale.}
\label{tab:qual_seg_summary}
\begin{tabular}{lcc}
\toprule
Category & Artery & Vein \\
\midrule
Segmentation Accuracy and Robustness & 3.7 (0.7) & 3.6 (0.5) \\
Vessel Branch Abundance & 4.0 (0.5) & 3.9 (0.6) \\
Diagnostic Assistance & 4.0 (0.5) & 4.0 (0.7) \\
Mean Score & 3.9 (0.4) & 3.8 (0.5) \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Anatomical Labeling}

Labeling performance on the PTL test set (160 scans) is summarized in Table~\ref{tab:labeling_metrics}. Arteries consistently achieve higher voxel-, node-, and edge-level Dice than veins, reflecting greater geometric ambiguity and inter-patient variability in venous anatomy. To ensure a rigorous comparison, we re-evaluated the baseline implementation of \citet{xie2025efficient} using the exact same dataset splits and evaluation protocol. As shown in Table~\ref{tab:labeling_metrics}, PaSAL achieves labeling performance that is closely comparable to the baseline across all metrics for both arteries and veins, with only minor differences that are inconsistent in direction. Since identical IPGN model weights were used and no retraining was performed, these results indicate that enforcing graph connectivity does not degrade labeling accuracy, while providing a structured representation that may be useful for downstream and future vascular analyses.


\begin{table}[t]
    \centering
    \scriptsize
    \renewcommand{\arraystretch}{1.15}
    \caption{Anatomical labeling on the PTL test set (160 scans). Values are mean (std) Dice for PaSAL and the re-evaluated baseline using identical splits and metrics.}
    \label{tab:labeling_metrics}
    \begin{tabular}{lcccc}
        \toprule
        \textbf{Metric} & \textbf{Artery} & \textbf{Vein} & \textbf{Xie (A)} & \textbf{Xie (V)} \\
        \midrule
        Voxel Dice & 89.6 (3.8) & 83.1 (3.2) & 89.4 (4.0) & 83.0 (3.1) \\
        Node Dice & 98.1 (2.1) & 94.9 (2.9) & 98.3 (2.1) & 95.3 (2.6) \\
        Edge Dice & 90.7 (5.6) & 78.9 (4.8) & 90.4 (5.9) & 79.0 (4.5) \\
        \bottomrule
    \end{tabular}
\end{table}

% Additional qualitative examples are provided in Appendix Fig.~\ref{fig:labeling_examples_appendix}.  
% Expert ratings on 21 scans (Appendix Fig.~\ref{fig:qualitative_labeling_appendix}) averaged 3.1--3.7, with highest scores attributed to clinical interpretability
% As with segmentation, correlations between Dice metrics and expert ratings were weak (Appendix Figs.~\ref{fig:labeling_correlation_appendix}--\ref{fig:label_scatter_voxel_vs_expert_appendix}), suggesting that voxelwise and graph-based metrics capture only part of the clinically relevant variation in labeling quality.

Expert labeling assessments on 21 PTL scans (Table~\ref{tab:qual_label_summary}) yielded mean scores of 3.1--3.7, with the highest ratings for clinical interpretability, indicating that the labeled vascular trees are usable in clinical application. Correlations between Dice-based metrics and expert scores were consistently weak (maximum $|\rho|=0.38$, vein voxel Dice vs.\ label consistency, $p>0.05$; Appendix Table~\ref{tab:ptl-corr} and Figs.~\ref{fig:labeling_correlation_appendix}--\ref{fig:label_scatter_voxel_vs_expert_appendix}). Taken together, these findings show that voxelwise and graph-based Dice capture only a limited portion of what experts consider clinically important. In this high-performance regime, aiming for higher metric scores alone is unlikely to translate into improved clinical utility or better expert-perceived quality, underscoring the need for evaluation criteria beyond conventional accuracy metrics.

\begin{table}[t]
\centering
\scriptsize
\renewcommand{\arraystretch}{1.15}
\caption{Expert assessment of anatomical labeling on 21 PTL scans. Scores are mean (standard deviation) on a 0--5 scale.}
\label{tab:qual_label_summary}
\begin{tabular}{lcc}
\toprule
Category & Artery & Vein \\
\midrule
Label Consistency Across Branches & 3.1 (0.8) & 3.1 (0.8) \\
Correctness of Proximal vs.\ Distal Labeling & 3.3 (0.6) & 3.2 (0.7) \\
Usefulness for Clinical Interpretation & 3.7 (0.5) & 3.6 (0.5) \\
Mean Score & 3.4 (0.5) & 3.3 (0.5) \\
\bottomrule
\end{tabular}
\end{table}

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=\linewidth]{Figures/Results/arteries_veins_labeling_percent.png}
%     \caption{Micro-averaged anatomical labeling Dice on PTL (160 scans). Artery and vein performance is reported at voxel, node, and edge level.}
%     \label{fig:labeling_metrics}
% \end{figure}

\subsection{Clinical Viability}

We further evaluated full-pipeline performance, including label propagation to peripheral branches, on 63 longitudinal CT scans from 12 radiotherapy patients. Outputs remained anatomically coherent across timepoints, although substantial post-treatment deformation occasionally reduced local smoothness in distal branches. A representative baseline--follow-up pair is shown in Fig.~\ref{fig:visualization_pipeline_results}.

A clinical expert assigned average scores of 3.4--3.9 across anatomical completeness, labeling plausibility, and practical clinical utility (Table~\ref{tab:qual_clinical_summary}). No scan scored below~3, and lower ratings were primarily associated with coarse voxel spacing or treatment-induced anatomical shifts rather than systematic limitations of the pipeline. Per-scan ratings are provided in Appendix~\ref{chap:plainresults}.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{Figures/Results/3 Pipeline result visualization.PNG}
    \caption{Pipeline predictions for a representative patient from the clinical cohort. Left: baseline scan; right: post-radiotherapy follow-up.}
    \label{fig:visualization_pipeline_results}
\end{figure}

\begin{table}[t]
\centering
\scriptsize
\renewcommand{\arraystretch}{1.15}
\caption{Clinical expert evaluation on 63 longitudinal scans from 12 patients. Scores are mean (standard deviation) on a 0--5 scale.}
\label{tab:qual_clinical_summary}
\begin{tabular}{lcc}
\toprule
Category & Artery & Vein \\
\midrule
Anatomical Completeness and Accuracy & 3.4 (0.7) & 3.5 (0.6) \\
Consistency and Plausibility of Labeling & 3.5 (0.6) & 3.5 (0.6) \\
Clinical Utility & 3.9 (0.3) & 3.9 (0.3) \\
Mean Score & 3.6 (0.5) & 3.6 (0.4) \\
\bottomrule
\end{tabular}
\end{table}
