\section{Introduction}
Volumetric image segmentation is of great importance in many computer-aided medical applications, including auxiliary diagnosis and follow-up treatment.
Recently, deep learning-based approaches~\cite{milletari2016v, isensee2019automated} have achieved remarkable performance on semantic object segmentation in 3D medical images. However, the success of those supervised methods often requires a large quantity of images with pixelwise annotations, which are expensive and time-consuming to collect. % and require domain expertise.
In order to mitigate this problem, weakly-supervised methods have been explored~\cite{rajchl2016deepcut,kervadec2019constrained,kervadec2020bounding}, which typically convert the volumetric segmentation to a series of 2D segmentation tasks and use box or scribble annotation only to train a segmentation network for the 2D tasks. 

%in the semantic segmentation literature. In particular, a wide range of weak annotation schemes have been proposed for learning segmentation of 2D natural images, including point~\cite{bearman2016s}, scribble~\cite{lin2016scribblesup}, box~\cite{dai2015boxsup}, or even image-level labels~\cite{pinheiro2015image}. 
%\cite{rajchl2016deepcut} included DenseCRF in training to iteratively update segmentation network, while  designed specific regularization loss utilizing box or point labels. 

Despite their promising results, those 2D-based approaches suffer from several limitations when applied to volumetric images (e.g., CT or MRI). Firstly, they simply stack the generated 2D masks together as the final output, and thus tend to produce inaccurate object shape~\cite{kervadec2019constrained,kervadec2020bounding}. In addition, they ignore the continuity of volumetric data in 3D space and are unable to exploit label correlation between consecutive 2D slices. %, which often leads to additional cost from redundant mask annotations. 
Due to these limitations, 2D methods tend to perform worse than their 3D counter parts given sufficient volumetric data with small inter-slices spacing~\cite{baumgartner2017exploration}. Furthermore, most weak annotations have a restrictive form~\cite{bearman2016s,lin2016scribblesup,dai2015boxsup,pinheiro2015image} and provide a poor guidance for learning object shape prior in medical images. More detailed discussions on related works are presented in Appendix~\ref{sec:related}.

In this paper, we propose a novel weakly supervised learning strategy for volumetric object segmentation to tackle the aforementioned limitations. Our main idea consists of two aspects: 
First, we propose a self-taught learning method to capture the 3D shape prior of a target object class based on object mask augmentation. 
%In particular, we observe that empirically a segmentation network trained with weak annotations can generate good 3D mask supervision, 
%which allows us to learn a shape denoising module for segmentation mask refinement. 
We then incorporate this learned shape prior into the training process of a shape-aware segmentation network. 
Second, we adopt a sparse annotation scheme to better exploit the spatial continuity of object mask and facilitate learning the shape context without increasing overall labeling cost. 

%Subsequently, based on our empirical observation, our weak labels can supply sufficient information to learn an initial segmentation model, which have above-average generalization to some instances. Based on this initialization, our model can learn a self-taught shape prior, which can further apply to shape denoising and refinement. Moreover, our segmentation learning and shape denoising can be further improved in an iterative manner.

%To tackle the volumetric segmentation task, 
To achieve this, we design a deep neural network consisting of two main modules: a Semantic Segmentation Network (SSN), which produces an initial 3D segmentation mask from the input image, and a Shape Denoising Network (SDN), which then refines the initial mask and outputs a final volumetric segmentation. To train the deep network, we first introduce a sparse weak annotation scheme, in which we annotate a specific subset of 2D image slices and design a hybrid label that integrates a foreground scribble and a loose bounding box of the target object. %Our annotation scheme allows us to reduce labeling cost and yet supply sufficient foreground and background label information. 
Given the weak labels, we then develop an iterative learning framework for our network model that alternates between pixelwise label generation and network parameter update. 

Specifically, we first initialize our network model by training the segmentation module (i.e., SSN) using the weak labels, which generates initial segmentation masks for the training data. We then utilize the initial masks to learn the shape refinement module (i.e., SDN) with a self-taught method. To that end, we choose a mask prediction with the highest confidence as the target shape, and train the SDN module as a denoising autoencoder for the object masks~\cite{vincent2010stacked, Sundermeyer_2018_ECCV,oktay2017anatomically}. To generate the noisy mask input, we apply several noise augmentation schemes to the target shape based on empirical error patterns in the initial mask predictions. 
After the model initialization, our learning procedure performs a two-step update iteratively, including pixel-wise pseudo label generation followed by network learning. For label generation, we fuse the predictions of our SSN and SDN with an uncertainty filtering mechanism, which allows us to utilize the learned shape prior to improve label quality. In the network learning, we freeze the SDN and update the SSN with the supervision of both weak and the generated label.      

%%For initialization, we obtain a self-taught shape prior by first training our SSN using weak labels, %with partial cross entropy loss, 
%and then choosing the best volumetric mask prediction according to the mask confidence. 
%With this shape prior, we can train our SDN using a denoising autoencoder training strategy~\cite{vincent2010stacked, Sundermeyer_2018_ECCV}. 
%For initialization, we first train our SSN using weak labels with partial cross entropy loss. 
%According to the confidence of initial predictions, we choose the best volumetric mask prediction as our self-taught shape prior. 
%To train our SDN, we adopt a denoising autoencoder training strategy~\cite{vincent2010stacked, Sundermeyer_2018_ECCV}. 
%Based on the observation of error patterns in the initial predictions of our SSN, we design several noise augmentation to this shape prior as input and train our SDN to output the clean shape. 
%To utilize the learned shape prior to further improve our model, we generate pseudo labels using the predictions of both our SSN and SDN with an uncertainty filtering mechanism, and update our model with supervision of both weak labels and pseudo labels, in an iterative manner. 

%Additionally, we develop a sparse labeling strategy for volumetric segmentation, including slice selection and a hybrid label design. We choose to label the starting and ending foreground slices and randomly choose some slices in between. Our hybrid label includes a foreground scribble denoting the long axis and a loose bounding box encircling all foreground pixels. 

We evaluate our method on three benchmarks with organs of distinctive shape properties: trachea in SegTHOR Challenge~\cite{trullo2019multiorgan}, left atrium in 2018 Atrial Segmentation Challenge and prostate in Promise12 Challenge~\cite{litjens2014evaluation}. The empirical results show that our method consistently outperforms previous approaches. Moreover, we achieve strong results with a small amount of annotation (10\% slices), when other existing methods would fail in that setting. 
%We focus on single organ cases and our method can naturally adapt to multi organs using a divide-and-conquer strategy. 

%Our main contributions are three-folds:
%(1) We develop a shape-aware weakly supervised volumetric segmentation method that incorporates a self-taught shape denoising network into the segmentation pipeline. 
%(2) Using a hybrid annotation scheme, we design an iterative procedure for effective network learning.  
%%(2) We propose a sparse labeling strategy for weakly supervised volumetric segmentation. 
%(3) Our approach achieves state-of-the-art performance on multiple settings of annotation density under the same labeling cost.
%\begin{itemize}
%	\item We develop a novel shape denoising network for mask refinement exploring self-taught 3D shape prior.
%	\item We propose an efficient labeling strategy for weakly supervised volumetric segmentation.
%	\item Our approach achieves state-of-the-art performance on all different annotation ratios, compared to other methods with the same labeling cost.
%\end{itemize}


\begin{comment}
\section{Introduction cvpr}
Volumetric segmentation for medical images is of great importance in many computer-aided clinical applications, including auxiliary diagnosis and follow-up treatment.
Recent learning-based approaches~\cite{milletari2016v, isensee2019automated} have achieved remarkable performance on various human organ semantic segmentation tasks. However, those fully supervised methods require a large amount of mask annotations, which are expensive and time-consuming to collect and require domain expertise.

Weakly supervised methods for semantic segmentation have been widely explored on 2D natural images. Instead of full mask, those methods utilize scribble~\cite{lin2016scribblesup, vernaza2017learning}, box~\cite{dai2015boxsup, song2019box}, point~\cite{bearman2016s} or even image-level labels~\cite{pinheiro2015image, kolesnikov2016seed, wang2020self} as supervision signals for their model training. For medical image segmentation, \cite{rajchl2016deepcut} included DenseCRF in training for iterative refinement of segmentation network, while \cite{kervadec2020bounding,kervadec2019constrained} designed specific constraints as regularization loss utilizing box or point labels.

These 2D-oriented methods, however, suffer from several limitations in the task of segmenting volumetric images, e.g., CT or MRI. Firstly, they ignore the continuity of volumetric data in 3D space and are unable to exploit label correlation between 2D slices, which often leads to additional cost from redundant mask annotations. In addition, existing labeling strategies for natural image semantic segmentation are incapable of capturing strong object shape or context prior in medical images, and hence can be less effective in practice. Furthermore, these methods simply stack 2D segmented mask predictions together as the final 3D segmentation, which tends to produce noisy and incomplete masks.

%    \begin{figure}[t!]
%        \centering 
%        \includegraphics[width=0.48\textwidth]{fig/a_intro2.png}
%        \caption{(a) Examples of our proposed weak annotations on left atrium. For slices with more than one connected components, we only choose one to annotate. (b) Examples of our shape denoising results. From top to bottom: trachea, left atrium, and prostate.} 
%        \label{fig:advertise}
%%      	\vspace{-0.3cm}
%    \end{figure}

% \begin{figure}[t!]
%     \centering 
%     \begin{subfigure}[b]{0.38\linewidth}
%         \includegraphics[width=\linewidth]{fig/a_weak_label2.png}
%         \caption{} 
%         \label{fig:intro_weak}
%     \end{subfigure}
%     \begin{subfigure}[b]{0.58\linewidth}
%         \includegraphics[width=\linewidth]{fig/a_shape_effect2.png}
%         \caption{} 
%         \label{fig:intro_shape}
%     \end{subfigure}
%     \caption{}
%     \label{}
% \end{figure}

In this paper, we propose an efficient labeling strategy and a novel framework for weakly supervised volumetric segmentation to tackle the aforementioned limitations. Our main idea consists of two aspects: First, we propose a sparse labeling strategy that is able to significantly reduce annotation cost compared to existing labeling strategies, and extract more semantic information under the same cost. Second, we develop a self-taught learning strategy to capture the 3D shape prior of target classes. In particular, we observe that empirically
a segmentation network trained with our weak annotation can generate good 3D mask supervision, which allows us to learn a shape denoising module for segmentation mask refinement. Moreover, we integrate our shape denoising and segmentation network learning into a unified EM framework.  

%is to improve labeling efficiency and to utilize a self-taught 3D shape prior to fill in missing labels. 

%Subsequently, based on our empirical observation, our weak labels can supply sufficient information to learn an initial segmentation model, which have above-average generalization to some instances. Based on this initialization, our model can learn a self-taught shape prior, which can further apply to shape denoising and refinement. Moreover, our segmentation learning and shape denoising can be further improved in an iterative manner.

Specifically, we first develop a sparse and efficient labeling strategy for volumetric segmentation, including slice selection and a hybrid label design. We choose to label the starting and ending foreground slices and randomly choose some slices in between. Our hybrid label includes a foreground scribble denoting the long axis and a loose bounding box encircling all foreground pixels. Our labeling strategy reduces label density and yet still supplies sufficient foreground and background label information.

To tackle the volumetric segmentation task, we develop a deep neural network that consists of two main modules: a Semantic Segmentation Network (SSN) and a Shape Denoising Network (SDN). The SSN first estimates an initial segmentation mask from the input image, and the SDN then refines the initial mask and outputs a final mask with better 3D shape.

We learn our model in an EM framework. For initialization, we first train our SSN using weak labels with partial cross entropy loss. According to the confidence of foreground predictions, we choose the best mask instance as our self-taught shape prior. To train our SDN, we adopt a denoising autoencoder training strategy~\cite{vincent2010stacked, Sundermeyer_2018_ECCV}. Based on our observation of error patterns in the initial predictions, we design several noise augmentation to this prior shape as input and train our SDN to output the predicted mask. In E-step, we generate pseudo labels utilizing the predictions of both our SSN and SDN with a simple uncertainty filtering mechanism. In M-step, we refine the SSN with weighted cross entropy loss on both weak labels and generated pseudo labels.

We evaluate our method on three benchmarks with organs of large shape variations: trachea in SegTHOR Challenge~\cite{trullo2019multiorgan}, left atrium in 2018 Atrial Segmentation Challenge and prostate in Promise12 Challenge~\cite{litjens2014evaluation}. The empirical results show that our method consistently outperforms previous approaches. Moreover, we note that our method can still achieve strong results with a small amount of annotation (10\% slices), when other existing methods would fail in that setting. 
%We focus on single organ cases and our method can naturally adapt to multi organs using a divide-and-conquer strategy. 
Our main contributions are three-folds:

\begin{itemize}
	\item We propose an efficient labeling strategy for weakly supervised volumetric segmentation.
	\item We develop a novel shape denoising network for mask refinement exploring self-taught 3D shape prior.
	\item Our approach achieves state-of-the-art performance on all different annotation ratios, compared to other methods with the same labeling cost.
\end{itemize}
\end{comment}