\section{Experiments and Results}
We validate our method refining the output of a 2D CNN in the tasks of pancreas and spleen segmentation.  We compare this approach with the refinement obtained from a conditional random field method \cite{bib:Krahenbuhl11}. Then, we evaluate the effects of different uncertainty thresholds $\tau$ in our refinement method. We also investigate how the number of training examples used to train the base CNN affects our refinement strategy. Finally, we analyze the relationship between the main components used to construct the graph and the refined segmentation obtained. We make our code publicly available for reproducibility purposes\footnote[1]{https://github.com/rodsom22/gcn\_refinement}. 

\subsection{Datasets}
We tested our framework using two CT datasets for pancreas, and spleen segmentation. For the pancreas segmentation problem, we used the NIH pancreas dataset\footnote[2]{https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT} \cite{bib:Roth16,bib:Roth15,bib:Clark13}. We randomly selected 45 volumes of the NIH dataset for training the CNN model and reserved 20 volumes for evaluating the uncertainty-based GCN refinement. 
For spleen, we employed the spleen segmentation task of the medical segmentation decathlon \cite{bib:simpson19} (MSD-spleen\footnote[3]{http://medicaldecathlon.com/}). For this problem, we trained the  CNN on 26 volumes and reserved 9 volumes to test our framework. The MSD-spleen dataset contains more than one foreground label in the segmentation mask. We unified the non-background labels of the  MSD-spleen dataset into a single foreground class since we evaluate our method for refining a binary segmentation model. 

\subsection{Implementation Details}
\subsubsection{CNN Baseline Model}
We chose a 2D U-Net to be our CNN model \cite{bib:Ronneberger15}. We included dropout layers at the end of every convolutional block of the U-Net, as indicated by the MCDO method. The U-Net was trained considering a binary segmentation problem. Since we are employing a 2D model, we trained the models using axial slices. At inference time, we predicted every slice separately and then we stacked all the predictions together to obtain a volumetric segmentation (a similar strategy was used to perform the uncertainty analysis). As a post-processing step, we compute the largest connected component in the prediction, to reduce the number of false positives. At this point, it is worth mentioning that the U-Net was used only for testing purposes and different architectures can be used instead. This is mainly because our refinement method uses the model-independent MCDO analysis.

\subsubsection{Uncertainty Analysis and GCN Implementation Details} 
We utilized MCDO to compute the expectation and entropy using a dropout rate of $0.3$ and a total of $T=20$ stochastic passes. To obtain volumetric uncertainty from a 2D model, we performed the uncertainty analysis on every individual slice of the input volume and then we stacked all the results together to obtain the volumetric expectation and entropy.  We tested different values for the uncertainty threshold $\tau$ (see section \ref{subsec:refinement-experimetns}).

The GCN model is a two-layered network with 32 features maps in the hidden layer and a single output neuron for binary node-classification.  The graphical network is trained for 200 epochs with a learning rate of $1e-2$, binary entropy loss, and the Adam optimizer. We kept these same settings for the refinement of both segmentation tasks. After the refinement process, we can replace only the uncertain voxels with the GCN prediction, or we can replace the entire CNN prediction with the GCN output. We use the second approach since we found it producing better results. 

\subsection{Comparison with State of the Art and Baseline CNN}

We applied our refinement method independently on every individual sample from the 20 NIH  and 9 MSD-spleen testing volumes. Since CRF is a common refinement strategy, we use the publicly available implementation of the method presented in \cite{bib:Krahenbuhl11} to refine the CNN prediction. This CRF method assumes dense connectivity. Similar to \cite{bib:Krahenbuhl11}, we set one unary and two pairwise potentials. We use the prediction of the CNN as the unary potential.  The first pairwise potential is composed of the position of the voxel in the 3D volume. The second pairwise potential is a combination of intensity and position of the voxels.  For the CRF refinement, we considered the same ROI used by the GCN.   
\begin{table}
			\caption{Average dice score performance (\%) of the GCN refinement compared with the CNN prediction and a CRF-based refinement of the CNN prediction. Results for pancreas and spleen are presented.} 
			\label{tab:comparison}
			\centering
			\begin{tabular}{l | c | c | c }
			\hline 
			Task  & CNN & CRF & GCN    \\ 
			      & 2D U-Net & refinement & Refinement (ours)  \\ 
			\hline
			Pancreas & $76.9 \pm 6.6$ & $77.2 \pm 6.5$ & $\mathbf{77.8 \pm 6.3}$ \\ 
			\hline
			\hline
			Spleen & $93.2 \pm 2.5$ & $93.4 \pm 2.6$ & $\mathbf{95.1 \pm 1.3}$ \\  
			\hline
			\end{tabular}
\end{table}

Results are presented in Table \ref{tab:comparison}. The GCN-based refinement outperforms the base CNN model and the CRF refinement by around 1\% and 0.6\% respectively in the pancreas segmentation task. For spleen segmentation, our GCN refinement presented an increase in the dice score of 2\% with respect to the base CNN, and 1.7\% with respect to the CRF refinement. Figs. \ref{fig:results_pancreas} and \ref{fig:results_spleen} show visual examples of the GCN refinement compared with the base CNN prediction.  

\begin{figure}
\begin{center}
\includegraphics[width=0.92\textwidth]{figure2.jpg}
\end{center}
\caption{Comparison of the CNN prediction and its corresponding GCN refinement for pancreas segmentation. Green colors indicate true positives (TP), red indicates false positives (FP), and white false negative (FN) regions. From left to right: the first column shows an FP region removed and an FN region recovered after the refinement.  The second and third columns show FP regions removed. The fourth column shows an FN region recovered but also a new FP region generated.} \label{fig:results_pancreas}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=0.92\textwidth]{figure3.jpg}
\end{center}
\caption{Comparison of the CNN prediction and its corresponding GCN refinement for spleen segmentation. Green colors indicate true positives (TP), red indicates false positives (FP), and white false negative (FN) regions. From left to right: the first, second and third columns show FN regions recovered. The fourth column shows an FN region recovered but also a new FP region generated.} \label{fig:results_spleen}
\end{figure}

\subsection{Influence of the Number of Training Samples} \label{subsec:refinement-numsamples}
We also evaluate the performance of the GCN refinement when the base CNN is trained with a small number of samples. For this, we randomly selected 10 out of the 45 training samples of the NIH dataset. For spleen, we selected nine. Results are presented in Table \ref{tab:reduce_training}.
\begin{table}
			\caption{Average dice score performance (\%) of the GCN refinement compared with the CNN prediction. The CNN model was trained with 10 samples for the pancreas and 9 for the spleen.} 
			\label{tab:reduce_training}
			\centering
			\begin{tabular}{l | c | c | c }
			\hline 
			Task  & CNN & CRF & GCN    \\ 
			      & 2D U-Net & refinement & Refinement (ours)  \\ 
			\hline
			Pancreas-10 & $52.10 \pm 22.61$ & $52.20 \pm 22.62$ & $\mathbf{54.50 \pm 22.15}$ \\ 
			\hline
			Spleen-9 & $78.80 \pm 28.40$ & $78.80 \pm 28.40$ & $\mathbf{81.15 \pm 28.90}$ \\ 
			\hline
			\end{tabular}
\end{table}
Note the increment in the standard deviation of all the models. A reason for this can be that the CNN does not generalize adequately to the testing set, due to the small number of training examples. Similar to the previous results, the increment in the dice score for the GCN refinement is about 2.4\% with respect to the CNN base model for the pancreas, and improvement of 2.3\% for spleen, compared with the base CNN.

\subsection{Influence of Uncertainty Threshold} \label{subsec:refinement-experimetns}
In our experiments, we evaluate the influence of different values for $\tau$. We tested the method with values of $\tau \in \{0.001, 0.3, 0.5, 0.8, 0.999\}$. In this way, we covered a wide range of conditions that define a voxel as ``uncertain''. After training the GCN, we replaced all the CNN predictions with the GCN output. Table \ref{tab:cnn_vs_gcnref} compares the CNN output with the GCN refinement at different values of $\tau$ for the tasks of the pancreas and spleen segmentation. 

\begin{table}
			\caption{Average dice score performance (\%) of the GCN refinement at different uncertainty thresholds $\tau$. Pancreas-10 and Spleen-9 indicate the models trained with 10 and nine samples, respectively.} 
			\label{tab:cnn_vs_gcnref}
			\centering
			\begin{tabular}{l | c | c | c | c | c }
			\hline 
			Task  &  GCN & GCN & GCN & GCN & GCN  \\ 
			      & $\tau = 1e-3$ & $\tau = 0.3$ & $\tau = 0.5$ & $\tau = 0.8 $ & $\tau = 0.999$  \\ 
			\hline
			Pancreas & $77.71 \pm 6.3$ & $77.79 \pm 6.4$ & $77.77 \pm 6.3$ & $77.81 \pm 6.3$ & $77.79 \pm 6.3$ \\ 
			\hline
			Pancreas-10 & $54.55 \pm 22.1$ & $54.32 \pm 22.1$ & $54.15 \pm 22.2$ & $53.91 \pm 22.4$ & $53.14 \pm 22.9$ \\ 
			\hline
			\hline
			Spleen & $95.01 \pm 1.5$ & $94.92 \pm 1.4$ & $94.98 \pm 1.4$ & $94.97 \pm 1.4$ & $95.07 \pm 1.3$ \\  
			\hline
			Spleen-9 & $80.91 \pm 28.8$ & $80.94 \pm 28.9$ & $80.94 \pm 28.8$ & $80.98 \pm 28.9$ &  $81.15 \pm 28.9$\\ 
			\hline
			\end{tabular}
\end{table}

The parameter $\tau$ controls the minimum requirement to consider a voxel as uncertain. Lower values lead to a higher number of uncertain elements. This has a direct relationship with the number of high certainty nodes in the graph representation, and hence, in the number of training examples for the GCN. This also influences the quality of the training voxels for the GCN, since a high threshold relaxes the amount of uncertainty necessary to rely on the prediction of the CNN. 

However, from the results of Table \ref{tab:cnn_vs_gcnref}, except for pancreas-10 and spleen-9, there is no significant impact on the choice of this parameter. One reason can be that there is a clear separation between high and low uncertainty points. Therefore, changing $\tau$ may add (remove) a few number of nodes that are insignificant for the learning process of the GCN. 

For the pancreas-10 model, we notice a progressive decrease in the dice score. Since this model uses fewer training examples, it is expected to have low confidence in their predictions (in contrast with the model trained with 45 volumes). In this scenario, a higher uncertainty threshold increases the chance to include high-uncertainty nodes as ground truth for training the GCN. A lower $\tau$ includes fewer points but with higher confidence. This appears to be beneficial in the pancreas segmentation model trained with fewer examples. 

The opposite occurs with spleen-9, where higher $\tau$ are beneficial. This might indicate a dependency on the characteristics of the anatomies since the pancreas presents more inter-patient variability.

In general, our results suggest that $\tau$ parameter should be selected based on the target anatomy. Further, $\tau$ appears to have more influence in conditions of high uncertainty, e.g. when the model is trained with fewer examples. In the cases where $\tau$ has no significant impact, intermediate values are preferred, since they lead to a lower number of nodes, and in consequence to lower memory requirements. 

\subsection{Deep Insights on Prediction, Expectation, and Entropy}\label{subsec:discussion}

We employed three elements from the uncertainty analysis in the definition of our graph: the CNN's prediction, the CNN's expectation, and the CNN's entropy. Fig. \ref{fig:graph_components} shows an example of these components.

\begin{figure}
\begin{center}
\includegraphics[width=0.92\textwidth]{figure4.png}
\end{center}
\caption{Elements used in the graph definition. In the CNN and GCN outputs: Green colors indicate true positives, red false positives, and white false negative regions. For the expectation and entropy: brighter intensities indicate higher values.} \label{fig:graph_components}
\end{figure}

The labels of the graph are given by the CNN's high-confidence prediction. However, from   Fig. \ref{fig:graph_components} we can see that the refinement is similar to the expectation. The expectation is one of the features of the nodes. Also is the main component for the diversity in the edge's weighting function (see section \ref{subsubsec:edges_weighting}). The GCN can learn how to use the CNN's expectation, together with intensity and spatial information,  to reclassify the nodes of the graph. However, it can also generate false positives if the expectation contains artifacts.  Fig. \ref{fig:graph_components} shows an example of this case, where we can see a region in the expectation that does not agree with the ground truth. It can be also noticed that the GCN reduced this region. This can be a result of the random long-range connections included in the graph definition. 

In our last experiment, we evaluate the relationship between the expectation and the GCN refinement. For this, we compute the relative improvement between the GCN and the expectation. First, the expectation was thresholded by 0.5. Then we computed its dice score with the ground truth. The relative improvement is computed as:
\begin{equation}
rel\_imp = \frac{gcn_{dsc} - expectation_{dsc}}{expectation_{dsc}} \times 100.
\end{equation}
We compute $rel\_imp$ for every input volume. Fig. \ref{fig:comparison_expectation} shows the results for the pancreas segmentation task, and compares the metric when the expectation was obtained from a model trained with 45 (Fig.\ref{fig:comparison_expectation}a) and 10 samples (Fig.\ref{fig:comparison_expectation}b), respectively, for pancreas segmentation.  

\begin{figure}
\begin{center}
\includegraphics[width=\textwidth]{figure5.jpg}
\end{center}
\caption{Relative improvement (\%) per input volume of different GCN configurations respect to the expectation of a pancreas segmentation model. The red line indicates the same dsc as the expectation. a) CNN trained with 45 volumes, $\tau=0.8$. b) CNN trained with 10 volumes, $\tau=0.001$.} \label{fig:comparison_expectation}
\end{figure}
Fig. \ref{fig:comparison_expectation}a shows that most of DICE coefficients (17/20) of the GCN refinement are either below or close to the ones of the expectation. However, three volumes show an improvement in the DICE compared to the expectation. This is different in Fig. \ref{fig:comparison_expectation}b. Here, (13/20) volumes show either better or similar DICE for the GCN compared to the expectation. A possible explanation is that models trained with adequate number of examples (volumes), their expectation is good enough. In contrast, models trained with a few examples (volumes) have higher uncertainties yielding unreliable expectations. Our results suggest that our GCN refinement strategy is favourable over the expectation or uncertainty analysis in such scenarios.
