\section{Related Work}\label{sec:related}
\paragraph{Weakly Supervised Semantic Segmentation}
To reduce labeling cost, weakly supervised semantic segmentation (WSSS) uses coarse annotations, e.g., image-level labels~\cite{wang2020self, fan2020learning, chang2020weakly, sun2020mining, kolesnikov2016seed, huang2018weakly, wang2018weakly}, bounding boxes~\cite{khoreva2017simple, song2019box, papandreou2015weakly, kervadec2020bounding} and scribbles~\cite{vernaza2017learning, tang2018regularized}. %For medical volumetric segmentation, we focus on bounding boxes and scribbles.

Existing methods can be roughly divided into two groups. The first group adopts an iterative learning framework~\cite{papandreou2015weakly, rajchl2016deepcut, cai2018accurate, can2018learning, ji2019scribble}, which (iteratively) generate pseudo labels as supervision for their models.
%They first train a segmentation network with weak labels as initialization, then process predicted masks into pseudo labels in E-step and optimize parameters in M-step. 
These approaches can (iteratively) obtain training signals from pseudo labels, but yet still suffer from error propagation. 
The second group avoids error propagation by applying a regularization-based framework~\cite{pathak2015constrained, kervadec2020bounding,tang2018regularized}. BoxPrior~\cite{kervadec2020bounding} proposes several global constraints derived from box annotations to optimize the segmentation model. KernelCut~\cite{tang2018regularized} proposes to use MRF/CRF regularization loss terms for better mask predictions. 
\cite{peng2020discretely} adopts a 3D segmentation framework with discrete constraints and regularization priors, but it requires an accurate global anatomical atlas which is hard to obtain in real scenario.
Compared to iterative learning methods, these approaches are light-weighted but do not utilize learned information for further improvement of their models.

We adopt an iterative learning framework for its advantage of incrementally including training signals, and design a shape denoising model to mitigate the problem of error propagation. 
In addition, existing methods ignore intrinsic shape priors of objects, while we exploit a self-taught shape prior from weak labels to improve segmentation performance.

\paragraph{Self-taught Learning}
Lack of training data is a common challenge in learning problems and self-taught learning is a promising solution. 
%Self-taught learning assumes that label spaces between the source domain and target domain are different, whose goal is to employ unlabeled source domain data to solve the target domain task. 
Recent works apply self-taught learning in classification~\cite{raina2007self, wang2013robust, feng2020autoencoder}, clustering~\cite{li2017self, dai2008self} and detection~\cite{bazzani2016self, jie2017deep}. We introduce self-taught learning to weakly supervised segmentation. We first extract a self-taught shape representation by leveraging weak labels with a segmentation network and then utilize a shape denoising network to encode this representation for further shape denoising and refinement.

% In classification tasks, \cite{raina2007self, wang2013robust} propose to learn high-level image patterns via the sparse coding algorithm from a large amount of unlabeled images, and \cite{feng2020autoencoder} proposes a metric for relevance between a source sample and the target samples. For clustering tasks, \cite{li2017self} presents a self-taught low-rank coding framework, and \cite{dai2008self} proposes a co-clustering algorithm. \cite{kuen2015self} incorporates self-taught learning into the visual tracking task, by learning local invariant representations from unlabeled data. In detection tasks, \cite{bazzani2016self, jie2017deep} apply self-taught learning to learn self-taught detectors without or with human supervision.
% ours
% We introduce self-taught learning to weakly supervised segmentation. We first extract a self-taught shape prior from weak labels with a segmentation network and then utilize a shape denoising network to encode this prior for further shape denoising and refinement. Instead of selecting numerous samples from the source domain, we only select a single sample of high confidence and enrich training signals with specifically designed augmentation.


\paragraph{Denoising Autoencoder}
%We design a Shape Denoising Network (SDN) to encode a unified shape prior and then to apply to the initial coarse mask for shape refinement, inspired by Denoising Autoencoder (DAE)~\cite{vincent2010stacked} and Augmented Autoencoder (AAE)~\cite{Sundermeyer_2018_ECCV}. 
Denoising Autoencoder (DAE)~\cite{vincent2010stacked} encodes an image into a latent embedding which is invariant to noise, to represent the original clean image. Augmented Autoencoder (AAE)~\cite{Sundermeyer_2018_ECCV} produces the orientation encoding of the object in the input image, which is invariant to other transformation and environmental conditions. Different from these methods aiming for a representative embedding, our goal is to recover a clean and complete shape from an input mask. Our method implicitly captures the underlying manifold of true shape data, instead of images or object orientations. ACNNs~\cite{oktay2017anatomically} also investigates modeling shape prior with an autoencoder for fully supervised segmentation and image super resolution, as regularization constraints in encoding space, while we develop a self-taught learning method and design a shape denoising autoencoder to explicitly perform denoising and to recover the clean shape.
%\cite{vincent2008extracting} first proposes to use a denoising autoencoder, to learn a robust representation by learning to reconstruct artificially corrupted training data. \cite{geras2014scheduled} further presents a representation learning method that learns features at multiple different levels of scale. \cite{xiong2016denoising} maps raw images to hierarchical representations in an unsupervised manner, and \cite{gondara2016medical} apply denoising autoencoder for efficient denoising of medical images. In this paper we leverage denoising autoencoder to encode a self-taught shape prior and to denoise volumetric shapes.












\begin{comment}

The lack of training data is a common challenge in learning problem, and self-taught learning is a promising solution to tackle it. Different from semi-supervised learning methods, self-taught learning assumes that label spaces between source domain and target domain are different. Self-taught learning aims to employ unlabeled source domain data to solve the target domain task. 

Most efforts in self-taught learning have been focused on using the entire source samples to achieve knowledge transferring, by learning generalized features or representations. 

In supervised classification tasks, [8] first propose a self-taught learning approach that uses a large amount of unlabeled images. After constructing high-level features via the sparse coding algorithm, they learn a classfier by using the SVM algorithm. Following [8], [16] further presents a robust and discriminative approach by imposing sparse regularizations in learning high-level image patterns. 

For clustering tasks, self-taught low-rank coding framework [17] is proposed by employing a low-rank constraint, to characterize the global structural information in the target domain. [10] proposes a co-clustering algorithm to tackle unsupervised transfer learning problems, by learning a general feature representation with numerous unlabeled auxilliary data.

In the task of visual tracking, [11] leverages local invariant representations learned from unlabeled data, to transfer to the observational model of the proposed tracker. [14] intergrates self-taught learning into hyperspectral image classification, by learning models to extract generalized features from large quantities of unlabeled data. 

In detection tasks, [12] propose an detection scheme with self-taught localization hypotheses, which embed the idea that dropping recognition scores reflects object coverage changing. [13] incorporate self-taught detector into weakly-supervised detection, by proposing a seed sample acquisition method via image-to-object transferring and dense subgraph discovery to find reliable positive samples for the detector.

To avoid negative knowledge transfer and achieve an effective sample selection in self-taught classification tasks, [2020] proposes a metric for relevance between a source sample and the target samples.

% ours
Our self-taught shape prior differs from above methods in three-folds. First, we introduce self-taught learning to the weakly-supervised segmentation, by proposing a shape denoising network that aims to encode and denoise 3D object shapes. Secondly, for training samples, existing methods usually select numerous samples from source domain, while we select a single sample of high confidence from segmentation prediction domain. Lastly, we apply designated shape-aware augmentations to broaden the training set.

\end{comment}

\begin{comment}
    
% image-level (需要加上CVPR2020, ECCV2020的)
[19, 25, 27].
1. Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation
2. Learning Integral Objects With Intra-Class Discriminator for Weakly-Supervised Semantic Segmentation
3. Weakly-Supervised Semantic Segmentation via Sub-category Exploration
4. Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

11. Seed, expand and constrain: Three principles for weakly-supervised
image segmentation. In Proc. European Conference on Computer Vision (ECCV), 2016.

Constrained convolutional neural networks for weakly supervised segmentation. In Proc. IEEE International Conference
on Computer Vision (ICCV), 2015.

From image-level
to pixel-level labeling with convolutional networks. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[38, 20, 15, 37, 40, 14].
Object region mining with adversarial erasing: A simple classification to semantic
segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1568–1576, 2017.

12. Weakly-supervised semantic segmentation
network with deep seeded region growing. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7014–7023, 2018.

13. Weakly supervised semantic segmentation by iteratively mining
common object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1354–1362, 2018

% Revisiting dilated convolution:
% A simple approach for weakly- and semi-supervised semantic segmentation. In Proceedings of the IEEE Conference
% on Computer Vision and Pattern Recognition, pages 7268–
% 7277, 2018.

% Self-erasing network for integral object attention. In
% Advances in Neural Information Processing Systems, pages
% 549–559, 2018.

% box
[7,18],
√ Boxsup: Exploiting bounding boxes to supervise convolutional networks for
semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
Simple does it: Weakly supervised
instance and semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.

[7, 34],
Box-driven class-wise region masking and filling rate
guided loss for weakly supervised semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3136–3145, 2019.

Weakly-and semisupervised learning of a deep convolutional network for semantic image segmentation. In: ICCV (2015)

% scribble
[22, 30]
√ Scribblesup: Scribble-supervised convolutional networks for
semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
√ Learning randomwalk label propagation for weakly-supervised semantic segmentation. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.



% Accurate Weakly-Supervised Deep Lesion Segmentation Using Large-Scale Clinical Annotations: Slice-Propagated 3D Mask Generation from 2D RECIST
[1] proposes a weakly-supervised lesion segmentation framework that gradually use CNN output to train incremental slices.
% Learning to Segment Medical Images with Scribble-Supervision Alone
[2] explores iterative two-step procedure with scribble labels, in which a segmentation network is trained on the previous labels, then comjugated with a conditional random field (CRF) to relabel the training set.
% 


\end{comment}