\section{Methods}

\textit{Overview}: Consider an input volume $V$ with $V(x)$ the intensity value at the voxel position $x\in \mathbb{R}^3$; consider also, a trained CNN $g(V(x); \theta)$ with parameters $\theta$; and a segmented volume $Y(x) = g(V(x); \theta)$ with $Y(x) \in \{0, 1\}$. Our objective, is to refine the segmentation $Y$ using a graph convolutional neural network (GCN) trained on a graph representation of the input data. Our framework operates as a post-processing step (one volume at a time) and assumes that no information about the real segmentation (ground truth) is available.

We first look for a binary volume $U_b$ used to highlight the potential false positives and false negatives elements of $Y$. The second step uses $U_b$, together with information coming from $Y$, $g$, and $V$, to refine the segmentation $Y$. We use uncertainty analysis to define $U_b$. For the second step, we solve the refinement problem using a semi-supervised GCN trained on a graph representation of our input volume.

\subsection{Uncertainty Analysis: Finding Incorrect Elements}\label{subsec:unc_analysis}

In our framework, incorrect elements are estimated considering the confidence of $g$.  We employ MCDO approximation \cite{bib:Kendall17} to evaluate the uncertainty of the CNN. This strategy can be applied to any model trained with dropout layers, without modifying or retraining the model. This attribute makes it ideal for a post-processing refinement algorithm. MCDO uses the dropout layers of the network in inference time, and perform $T$ stochastic passes on the network to approximate the output of a Bayesian neural network. Following this method, we get the model's expectation
\begin{equation}
\mathbb{E}(x) \approx {\frac{1}{T}}\sum_{t=1}^{T}g({V(x)}, \theta _t),
\end{equation}
with $\theta_t$ the model parameters after applying dropout in the pass $t$. The model uncertainty $\mathbb{U}$ is given by the entropy, computed as 
\begin{equation}
\mathbb{U}(x) = H(x) = -\sum_{c=1}^M P(x)^c\log{P(x)^c},
\end{equation}
with $P(x)^c$ the probability of the voxel $x$ to belong to class $c$, and $M$ is the number of classes ($M=2$ in our binary segmentation scenario). We use $\mathbb{E}$ as an approximation of the probability volume $P$ for computing the entropy. Finally, we define the potential incorrect elements by applying a binary threshold on the entropy volume 
\begin{equation}
U_b(x) = \mathbb{U}(x) > \tau,
\end{equation}
where the uncertainty threshold $\tau$ controls the entropy necessary to consider a voxel $x \in Y$ as uncertain.

\subsection{Graph Learning for Segmentation Refinement}\label{gcn_refinemet}

At this point, we have a binary mask $U_b$ indicating voxels with high uncertainty. The uncertainty analysis only tells us that the model is not confident about its predictions.  Some of the elements indicated by $U_b$ could be indeed correct and its value should not be changed. 
However, we can use a learning model that trains on high confidence voxels to reclassify (refine) the output of the CNN $g$.
Using the information from the uncertainty analysis, we can define a partially-labeled graph, where the voxels are mapped to nodes, and neighborhood relationship to edges.  In this way, we formulate the refinement problem as a semi-supervised graph learning problem. We address this mapped problem by training a GCN on the high confidence voxels using the methods presented in \cite{bib:Kipf17}. The rest of this section describes the formulation of our partially-labeled graph. 

\subsubsection{Partially-Labeled Nodes}
Given a graph $\mathcal{G}$ representing our 3D volumetric data, at the inference tine, we aim to obtain a refined segmentation $Y^*$ as the results of our GCN model $\Gamma$,
\begin{equation}
Y^* = \Gamma(\mathcal{G}(S);  \phi),
\end{equation}
where the graph $\mathcal{G}$ is constructed from the set of volumes $S=\{\mathbb{E}, \mathbb{U}, V, Y\}$  (see section \ref{subsec:unc_analysis} and Fig. \ref{fig:gcn}) and $\phi$ represents the GCN's parameters.
\begin{figure}
\begin{center}
\includegraphics[width=0.8\textwidth]{figure1.jpg}    
\end{center}
\caption{a) The GCN refinement strategy. We construct a semi-labeled graph representation based on the uncertainty analysis of the CNN.  Then, a GCN is trained to refine the segmentation. b) Connectivity. The black square is connected to six perpendicular neighbors and with $k=16$ random voxels} \label{fig:gcn}
\end{figure}

Since most of the voxels in the volume are irrelevant for the refinement process and given that graphs are not restricted to the rectangular structured representation of data, we define an ROI tailored to our target anatomy. We define our working region as $\hbox{ROI}(x) = \hbox{dilation}(U_b(x)) \cup \mathbb{E}_b(x)$ with $\mathbb{E}_b$ the expectation binarized by a threshold of $0.5$. Since the entropy is usually high in boundary regions, including the dilated $U_b$ ensures that the ROI is bigger enough to contain the organ. Also, this allows us to include high confidence background predictions ($Y = 0$) for training the GCN. Including the expectation in the ROI give us high confidence foreground predictions for training the model. This ROI reduces the number of nodes of the graph and, in consequence, the memory requirements. 
The voxels $x \in \hbox{ROI}$ define the nodes for $\mathcal{G}$. Each node is represented by a feature vector containing intensity $V(x)$, expectation $\mathbb{E}(x)$, and  entropy  $\mathbb{U}(x)$. Finally, we labeled each node in the graph according to its uncertainty level using the next rule:
\begin{equation}
  l(x) =
  \begin{cases}
        Y(x) & \text{if $U_b(x)=0$} \\
  	    \text{unlabeled} & \text{if $U_b(x)=1$}
  \end{cases}
\end{equation}

\subsubsection{Edges and Weighting} \label{subsubsec:edges_weighting}

The most straightforward connectivity option is to consider the connectivity with adjacent voxels (6 or 26 adjacent voxels). However, this simple nearest neighborhood scheme may not be adequate in our problem for two reasons; First, with this scheme, every single voxel is connected with its local neighborhood but lacks global information. Second, voxels with high uncertainty tend to shape contiguous clusters. With a simple nearest neighborhood scheme, voxels inside these clusters will be only connected to their adjacent neighbors, i.e. uncertain voxels, with almost no connection with voxels with high confidence. Voxels living in the boundary of these clusters are the only ones who are connected to voxels with high confidence. Hence, the propagation of information from confidence to the uncertain regions will be somehow limited. A  fully-connected graph can take advantage of the relationships between certain and uncertain regions in training and inference time but at a cost of prohibitive memory requirements. In our work, we evaluate an intermediate solution. For a particular node (or voxel) $x$, we create connections with its six perpendicular immediate neighbors in the volume coordinate system. Additionally, we randomly select $k=16$ nodes in the graph and create a connection between these random elements and $x$. This defines a sparse representation that considers connections between labeled and unlabeled elements.

To define the weights for  the edges, we use a function based on Gaussian kernels considering the intensity $V (x)$ and the 3-D position $x \in \mathbb{R}^3$ associated with the node:
\begin{equation}
\begin{split}
w(x_i, x_j) = \lambda \hbox{div}(x_i, x_j)+ & \exp(-{\frac{||V (x) - V (x_j)||^2}{2\sigma_1} })  \\
& + \exp(-{ \frac{||x_i - x_j||^2}{2\sigma_2} } )
\end{split}
\end{equation}
where $\lambda$ is a balancing factor, $\hbox{div}(\cdot)$ is given by the diversity between the nodes \cite{bib:Zhouz17}, defined as $ \hbox{div} (x_i, x_j) = \sum_{c=1} ^M (P ^c(x_i) - P ^c(x_j))\log{\frac{P ^c(x_i) }{P ^c(x_j)}}$ with $M = 2$, $P^1(x_i) =\mathbb{E}(x_i)$ and $P ^2(x_i) = 1 - P^1(x_i)$ for our binary case. We opt for an additive weighting, instead of a multiplicative one, because the GCN can take advantage of connections with both similar and dissimilar nodes in the learning process, and using a multiplicative weighting could cut dissimilar connections. We found out that the diversity can indirectly bring information about the similarity of two nodes, in terms of class probability.