\section{Evaluation}\label{sec:evaluation}

Our method predicts an acyclic vessel skeleton, with 3D coordinates assigned to each node, and edges oriented towards the root, forming a labeled directed acyclic graph (DAG). 
Comparing such graph against respective ground-truth (GT) is challenging \cite{Drees2019gerome,Lyu2022reta}: no standard metric exists (see Suppl. \ref{sec:metrics_discussion}), and many measures lack intuitive topological meaning or depend sensitively on node matching and sampling.
%
To compare predicted and GT graphs, we first resample both at a fixed step size $s>0$.
The next essential step is an \emph{assignment strategy} (cf.\ \citet{metrics_reloaded2024}) that matches nodes and edges between predicted and GT graphs. 
In Sec.\ \ref{subsec:hierarchical_matching} we propose a greedy hierarchical matching procedure designed for robust topological correspondence.
Based on these correspondences, we compute error metrics as described in Sec.\ \ref{sec:metrics_def}.

\subsection{Hierarchical Matching}
\label{subsec:hierarchical_matching}
Commonly used approaches to assign nodes or edges of two graphs to each other, is greedy nearest-neighbor- or optimal matching based solely on spatial proximity, such as in \citet{Drees2019gerome,trexplorer_super2025}.
While effective in simple scenarios, this strategy ignores the structural and semantic information inherent to tree-like graphs, making it unsuitable for capturing topological similarity—particularly in cases where different vessels are close-by.
Therefore, we propose a greedy one-to-at-most-one hierarchical matching scheme that incorporates spatial, semantic and ancestor information.
It is similar to \citet{Gillette2011DIAMEMetric}, but also applicable in multi tree scenarios.

A pseudo-code description of the matching procedure is provided in Suppl.~\algorithmref{alg:hierarchical-matching}. In short, first, the connected components of the GT graph $G$ and predicted graph $P$ are determined. 
Each node in $G$ and $P$ is assigned a semantic class---\textit{root}, \textit{branching point}, \textit{leaf}, or \textit{intermediate}. 
For every node in $G$, we identify candidate nearest neighbors in $P$ within a predefined distance threshold and rank them, first by semantic correspondence and second by spatial proximity.
%
We then iterate over the GT roots, always selecting next the root whose best candidate exhibits the highest matching priority (i.e., first by identical semantic class and second by minimal distance).
Starting from each root, we perform \textit{two depth-first traversals}.
In the first, we visit branching and leaf nodes, and assign them to the best available candidate based on the matching status of the candidate’s parent, the candidate's semantic label, and its distance. 
Thus, candidates whose parents are matched within the same GT tree receive highest priority.
If no suitable candidate exists, the GT node remains unmatched.
The second traversal processes intermediate nodes using the same criteria.
After completing a GT tree, we proceed to the next. 
Importantly, the candidate lists are updated immediately whenever two nodes become matched to maintain consistency throughout the hierarchy. 
%
A quantitative comparison with greedy nearest-neighbor and Hungarian matching is presented in Suppl.\ \tableref{tab:matching_comparison}.

\subsection{Metrics Definitions}\label{sec:metrics_def}
Based on our literature review in Suppl.\ \sectionref{sec:metrics_discussion}, we report the \emph{edge-wise} F1 score, as used in \citet{Drees2019gerome,Drees2021voreenSkel}, since the F1 score is a widely established and, in our view, easily interpretable measure of topological correctness when applied to edges.
Yet F1 alone does not capture the structural impact of certain errors. For instance, a false positive edge connecting unrelated nodes can distort the topology far more than a shortcut to an ancestor (see \figureref{fig:fm_fs}).
To address this, we introduce \textit{false splits} and \textit{false merges} as additional topology-aware error measures, extending prior work \cite{matula2015tra,fisbe} to multi-tree graphs where errors may arise both within and across trees.
In addition, we report the metrics used in \citet{trexplorer_super2025} for comparability reasons.
Although they also report F1 scores at the node and branch level—similar in spirit to our recommendation—their metrics rely on greedy one-to-one nearest-neighbor matching and the computation operates on individual nodes, thereby not fully capturing connectivity.
Regarding graph-level Betti numbers, only Betti–0 (the number of connected components) is meaningful; because assuming only trees, Betti–1 (the number of cycles) is always zero.

\begin{figure}
    \centering
    \scriptsize
    \resizebox{1.0\linewidth}{!}{\includegraphics{figures/fm_and_fs.pdf}}
    \caption{\textbf{False Merges \& False Splits.}
    (a) A ground-truth skeleton next to three possible predictions.
    (b) The predicted graph has one FN and one FP edge. The FP is a false merge since it connects two nodes which are not ancestor of each other. Consequently, the FN is a false split. 
    (c) The FP edge is \emph{not} a false merge since it keeps the ancestor relation w.r.t. to its parent node intact. Consequently, the FN is \emph{not} a false split.
    (d) The predicted graph only has FN edges which are \emph{not} false splits since they do not change the node ancestor relations.
    \label{fig:fm_fs}}
\end{figure}

\paragraph{Edge-wise F1 Score.}
In the following, we denote by $G$ and $P$ the GT and predicted graphs with node sets $V_G$ and $V_P$, and by $\Phi: V_G \rightarrow V_P$ the one-to-one node matching.
Both graphs are resampled to a fixed step size ($s=1$ voxel, unless stated otherwise), after which we apply our proposed hierarchical matching.
%
The edge-wise F1 score is computed as the balanced measure of precision and recall, relating the number of correctly matched edges (true positives, TP) in $G$ to the number of incorrectly matched (false positives, FP) or incorrectly unmatched (false negatives, FN) edges, where TP, FP and FN are defined as follows.
\begin{itemize}
    \item An edge \( (v,v') \) in \( G \) is a TP if and only if $(\Phi(v), \Phi(v'))$ is an edge in $P$.
    \item An edge \( (v,v') \) in \( G \) is a FN if and only if $v$ and $v'$ were matched and $(\Phi(v), \Phi(v'))$ is \textit{not} an edge in $P$.
    \item An edge \( (\Phi(v),\Phi(v')) \) in \( P \) is a FP if and only if $(v,v')$ is \textit{not} an edge in $G$.
\end{itemize}

The \emph{edge-wise F1, precision and recall} are defined as
\[
F_1^{\text{edge}} := \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}, \quad
Precision^{\text{edge}} := \frac{TP}{TP+FP}, \quad Recall^{\text{edge}} := \frac{TP}{TP+FN}
\]

\paragraph{False Merges (FM) and False Splits (FS).}
We further define false merges and false splits as follows.
A false merge is a FP edge \( (\Phi(v), \Phi(v')) \) in $P$ where the GT nodes \( v \) and \( v' \) in $G$ have no directed path between them. In other words, neither is an ancestor of the other. 
For a false split, we consider the subgraph \( P^* \) of \( P \) obtained by excluding all false merge edges. A false split is a FN edge \( (v,v') \) in \( G \) such that adding the missing edge \( (\Phi(v), \Phi(v')) \) to \( P^* \) merges two connected components into one. This means FS correspond to missing edges in \( P \) that cause false disconnections in \( P^* \). 
The number of FS can be determined by
$\beta_0(P^*) - \beta_0(G)$, since each FS increases the number of connected components of $P^*$. Here, $\beta_0$ is the number of connected components (Betti-0).

\paragraph{Trexplorer-super Evaluation for Comparability.}
To compute the metrics from \citet{trexplorer_super2025}, both $G$ and $P$ are resampled to one voxel spacing.
At the node level, precision, recall, and F1 are reported, along with radius accuracy measured via the mean absolute error (MAE).
At the branch level, the F1 score is reported; where a branch is considered a TP if at least 80\% of its nodes are matched.
