%\vspace{-0.3cm}
\section{Method}


We now introduce our method for weakly supervised volumetric segmentation. %The main idea is to learn a self-taught shape prior from weak labels and then to utilize this prior for further shape denoising and refinement. %To achieve this, we develop a deep neural network consisting of two main modules: a Semantic Segmentation Network (SSN) and a Shape Denoising Network (SDN). Our SSN first predicts an initial segmentation mask from the input image volume and then our SDN applies the self-taught shape prior on this initial mask for denoising and refinement. 
%In the remaining of this section, we introduce our method design in detail. 
We start from the problem setting and model overview in Sec.~\ref{sec:setting}, 
%followed by the network architecture in Sec.~\ref{sec:net}. 
followed by the model design in Sec.~\ref{sec:net}. 
To learn our network, we propose a weak annotation strategy in Sec.~\ref{sec:weak_label} and adopt an iterative learning framework in Sec.~\ref{sec:learning}. 
%An overview of our method is illustrated in Fig.~\ref{fig:model}.


\subsection{Problem Setting and Model Overview}\label{sec:setting}
Given an input volumetric image $\mathbf{I} \in \mathbb{R}^{H \times W \times D}$, our goal is to estimate its segmentation mask $\mathbf{M} \in \mathcal{S}^{H \times W \times D}$, where $H$ and $W$ are the height and width of an image slice, and $D$ is the number of slices. $\mathcal{S} = \{0, 1\}$ is the semantic label set with $0$ as background and $1$ as foreground.
Assume we are given a training set $\mathcal{D} = \{\mathbf{I}^n, \mathbf{Y}^n\}_{n=1}^N$, where $\mathbf{Y}^n \in \mathcal{S}'^{H \times W \times D}$ is the corresponding weak label to $\mathbf{I}^n$, $\mathcal{S}' = \mathcal{S} \cup \{u\}$ with $u$ representing unlabeled pixels, and $N$ is the number of training data samples. 
%Our model can learn from labeled pixels and generalize to unlabeled regions.
We focus on single foreground class segmentation in this paper, while our method can be applied to multi-class problems by dealing with each class separately.

The main idea is to learn a self-taught shape prior from weak labels and then to utilize this prior for further shape denoising and refinement. To achieve this, we develop a deep neural network consisting of two main modules: a Semantic Segmentation Network (SSN) and a Shape Denoising Network (SDN). Our SSN first predicts an initial coarse segmentation mask from the input image volume and then our SDN applies the self-taught shape prior on this initial mask for denoising and refinement. To incorporate learned shape prior to further improve our model, we adopt an iterative learning framework, by generating pseudo labels and updating the model iteratively. An overview of our method is in Fig.~\ref{fig:model}.


\begin{figure*}[t!]
	\centering 
	\includegraphics[width=1.00\textwidth]{fig/b_model_shape.png}
	\caption{(a) Overview of our method. (b) An example of our self-taught shape representation and corresponding augmentation effect on trachea.}
	%\caption{Our model consists of two main modules: Semantic Segmentation Network (SSN) and Shape Denoising Network (SDN). Our SSN predicts an initial segmented mask from the input volumetric image, which is further refined by our SDN as final output. To train our model, we first initialize the whole system by training our SSN on weak labels. Then we extract a self-taught shape prior with our SSN and use it as the training signal with our specifically designed noise augmentation, to train our SDN. Moreover, we further improve our model with an EM strategy. In E-step, we generate pseudo masks with both outputs from SSN and SDN, utilizing a simple uncertainty filtering mechanism. In M-step, we optimize our SSN with loss on both weak labels and pseudo labels.}
	\label{fig:model}
%	\vspace{-0.3cm}
\end{figure*}

\subsection{Model Design}\label{sec:net}
We now describe the two main network modules of our model in detail as below: 
%Our model consists of two main modules: a Semantic Segmentation Network (SSN) and a Shape Denoising Network (SDN). Our SSN predicts an initial segmented mask from the input volumetric image, which is further refined by our SDN as final output.

\paragraph{Semantic Segmentation Network}\label{sec:SSN}
Our Semantic Segmentation Network (SSN) $\mathcal{F}_{SSN}$ provides an initial coarse mask by taking a volumetric image $\mathbf{I}$ as input and outputting a probability map $\mathbf{P}_s \in [0,1]^{H\times W\times D}$, indicating the confidence of each pixel belonging to foreground. From $\mathbf{P}_s$ we can derive the initial foreground segmentation mask $\mathbf{M}_s$: 
%\begin{align}
%\mathbf{P}_s = \mathcal{F}_{SSN} (\mathbf{I}; \Theta), \quad \mathbf{M}_s = \mathds{1} (\mathbf{P}_s > 0.5) 
%\end{align}
$\mathbf{P}_s = \mathcal{F}_{SSN} (\mathbf{I}; \Theta), \mathbf{M}_s = \mathds{1} (\mathbf{P}_s > 0.5)$, 
where $\Theta$ denotes the parameters of $\mathcal{F}_{SSN}$ and $\mathds{1}(\cdot)$ is the indicator function.
We instantiate our SSN with nnU-Net~\cite{isensee2019automated}, which is the state-of-the-art model architecture for medical image semantic segmentation. Detailed network configurations are in Appendix~\ref{appendix:net_config}.

\paragraph{Shape Denoising Network}\label{sec:SDN}
We design a Shape Denoising Network (SDN) $\mathcal{F}_{SDN}$ to encode a unified shape prior and then to apply to the initial coarse mask for shape refinement, inspired by Denoising Autoencoder (DAE)~\cite{vincent2010stacked} and Augmented Autoencoder (AAE)~\cite{Sundermeyer_2018_ECCV}. 
%DAE encodes an image into a latent embedding which is invariant to noise, to represent the original clean image.
%AAE produces the orientation encoding of the object in the input image, which is invariant to other transformation and environmental conditions. 
%Different from these methods aiming for a representative embedding for an image or object orientation, our goal is to recover a clean and complete shape from an input mask.
Given the initial mask $\mathbf{M}_s$ from SSN output, our SDN implicitly applies self-taught shape prior constraints and outputs a clean and shape-refined mask $\mathbf{M}_d$: 
%\begin{align}
%\mathbf{P}_d = \mathcal{F}_{SDN} (\mathbf{M}_s; \Omega), \quad \mathbf{M}_d = \mathds{1} (\mathbf{P}_d > 0.5) 
%\end{align}
$\mathbf{P}_d = \mathcal{F}_{SDN} (\mathbf{M}_s; \Omega), \mathbf{M}_d = \mathds{1} (\mathbf{P}_d > 0.5)$, 
where $\Omega$ denotes the parameters of $\mathcal{F}_{SDN}$. Since we aim for the final mask rather than the latent embedding, our $\mathcal{F}_{SDN}$ shares the same U-Net architecture as $\mathcal{F}_{SSN}$, which keeps a larger spatial resolution at its bottleneck and includes skip connections to capture more mask details.


\begin{figure}[t!]
	\centering 
	\includegraphics[width=1.0\textwidth]{fig/c_weak_annotation.png}
%	\vspace{-0.5cm}
	\caption{We show different annotations for trachea (left) and left atrium (right). \emph{Scribble (dilation)} denotes generated scribble by foreground mask dilation.} 
	\label{fig:all_labels}
%	\vspace{-0.3cm}
\end{figure}




\subsection{Weak Annotation Strategy}\label{sec:weak_label}
In order to better exploit the spatial continuity of object mask and facilitate learning the shape context, we now introduce a sparse weak annotation strategy for the task of volumetric segmentation.
%To testify the capability of our model in of utilizing spacial continuity for label propagation and to encode more location context to further boost model performance at the same labeling cost, 
Our annotation scheme consists of two components, including \textbf{slice selection} and a \textbf{hybrid labeling strategy}. For \textbf{slice selection}, we choose to label the starting and the ending slices of each foreground object, which include important boundary information in z-axis. Except for those two slices, we also randomly label a subset of slices in between. %, to capture more rich information in all spatial positions of different object instances. 
In this work, we investigate multiple labeling strategies with 10\%, 30\%, 50\%, and 100\% labeled foreground slices. 
Moreover, we design a \textbf{hybrid labeling strategy} for 2D slices, including a long axis scribble for the foreground object and a loose bounding box to encircle all foreground pixels in this slice. 
Specifically, for the long axis scribble, the annotators only have to click two points near the boundary from inside the foreground object, which can be automatically connected to a line. For the loose bounding box, the annotators only need to click two points from upper left corner to lower right corner, which can also be automatically connected into a box. 
Note that our hybrid label does not require precise localization of boundary points, but only needs the annotators to roughly point out the inside and outside regions of foreground. 
Compared to traditional scribble or tight bounding box, our hybrid label provides rich background information and rough localization, as well as foreground pixel label, with only four points for each 2D slice.
To simulate our weak annotation strategy, we derive our annotations from full masks. Examples are shown in Fig.~\ref{fig:all_labels}. 
More details are presented in Appendix~\ref{appendix:weak_label}.


\subsection{Model Learning}\label{sec:learning}


%\paragraph{Iterative Model Learning}\label{sec:learning}
To effectively train our model, we adopt an iterative learning framework, 
by generating pseudo labels and updating the model iteratively. 
%by first training SSN and SDN as initialization, and then refining the whole model with generated pseudo labels iteratively.
To generate initial pseudo labels, we first initialize our SSN and SDN. 
Then we compute pseudo labels with an uncertainty filtering mechanism on the combination of the outputs of our SSN and SDN, to remove noise and to incorporate learned shape prior for model updating. 
Below we sequentially introduce the training of our SSN and our self-taught SDN, our uncertainty filtering for pseudo label generation, and finally our model updating.


\paragraph{Training Semantic Segmentation Network}\label{sec:bootstrap}
We first train our SSN on weak labels to provide initial segmentation masks, which serve as important training signals for our SDN. 
Given input images and corresponding weak labels, we %initialize our system by training 
train our SSN with weighted cross entropy on labeled pixels: $\mathcal{L}_{SSN} (\Theta) = \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y})$. Due to highly imbalanced foreground and background in weak labels, we adopt an auto-weighting strategy in our loss function. %, to systematically balance labeled foreground and background into 1:1 for each volume. %Empirical experiments show that this auto-weighting strategy is more stable and has better generalization to different datasets, compared to a fixed weighting ratio.
%After the initialization, our SSN is able to provide initial segmentation masks, which serve as important training signals for our SDN.
Specifically, for each volume with $N_b$ labeled background pixels and $N_f$ labeled foreground pixels, we compute the loss as $(\frac{1}{N_b} \sum^{N_b}_{i} l_i + \frac{1}{N_f} \sum^{N_f}_{j} l_j) / 2$, where $l_i$ denotes cross-entropy loss of pixel $i$.
%\begin{align}
%\mathcal{L}_{SSN} (\Theta) = \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y})
%\end{align}



\paragraph{Self-training of Shape Denoising Network}
We train our SDN with a self-taught learning strategy, different from previous methods for denoising model learning. 
DAE~\cite{vincent2010stacked} applies artificial random noise to input images and reconstructs corresponding clean targets. AAE~\cite{Sundermeyer_2018_ECCV} proposed a domain randomization technique to mimic environment and sensor variations captured by real cameras. They utilize this technique to augment their input synthetic images and reconstruct images invariant to irrelevant factors other than orientation.

In our case, one main difference is that we have no full mask annotations for the self-taught learning. One possible solution is to use digital synthetic shape models, which might have a large domain gap to different datasets, especially for masks in medical segmentation. 
To avoid such domain gap, we propose a self-taught learning strategy, to first extract a self-taught shape representation with our weakly supervised segmentation model SSN and then train our SDN with this shape representation. 
%To avoid such domain gap, we propose to extract a self-taught shape prior with our weakly supervised segmentation model SSN. 
The underlying assumption is that our trained SSN is able to generate masks with above-average accuracy for some instances in training split, and thus can supply those masks with better shape quality to help refine other masks. To this end, we first compute the average foreground probability of each $\mathbf{P}_s$ in training split as the confidence of each predicted mask $\mathbf{M}_s$, and then take the mask with the highest confidence as our self-taught shape representation $\mathbf{M}^*$ to train our SDN.

To train our SDN for noise removal, instead of adding general random noise to input masks, we specifically design our noise augmentation, based on the observation of errors produced by our initially trained SSN. 
We summarize the typical error modes of initial segmentation masks in three categories: (1) Over-smoothed regions;
(2) Wrongly attached blobs;
(3) Over-prediction of foreground regions beyond starting and ending slices. 
These errors occur mainly because there is no clear boundary supervision in weak labels, while the neighboring objects might share similar intensity or texture to the target object.

To equip our SDN with capability to deal with the aforementioned errors, we design three corresponding noise augmentation operations: 
(1) Closing; 
(2) Dilation; 
(3) Extension of marginal slices. 
Examples are as shown in Fig.~\ref{fig:model} (b). %~\ref{fig:shape_aug}.
Moreover, we apply spatial transformation including rotation, translation, and scaling, to capture rich position and size variations, which help learn the underlying shape manifold. Detailed information is in Sec.~\ref{sec:implementation}. 
We augment our self-taught shape representation $\mathbf{M}^*$ into $\mathbf{\hat{M}}^*$, and train our SDN to reconstruct the clean mask $\mathbf{M}^*$ with cross entropy loss: $\mathcal{L}_{SDN} (\Omega) = \mathcal{L}_{ce} (\mathcal{F}_{SDN} (\mathbf{\hat{M}}^*; \Omega), \mathbf{M}^*)$.



%\begin{align}
%%\mathbf{\hat{P}}_d &= \mathcal{F}_{SDN} (\mathbf{\hat{M}}_s; \Omega)\\
%\mathcal{L}_{SDN} (\Omega) &= \mathcal{L}_{ce} (\mathcal{F}_{SDN} (\mathbf{\hat{M}}^*; \Omega), \mathbf{M}^*)
%\end{align}


%\begin{figure}[t!]
%    \centering 
%    \includegraphics[width=0.45\textwidth]{fig/b_shape_aug2.png}
%    \caption{Examples of our shape augmentation operations.} 
%    \label{fig:shape_aug}
%%    \vspace{-0.3cm}
%\end{figure}




\paragraph{Uncertainty Filtering}
To incorporate the self-taught shape prior to further improve our model and to remove noise, 
we generate pseudo masks from predictions of both SSN and SDN with a simple uncertainty filtering mechanism. Specifically, we first compute the intersection of the segmented mask from SSN and the shape-refined mask from SDN, and then apply uncertainty filtering, based on the confidence of each pixel in $\mathbf{P}_s$ from the semantic segmentation output. We compute pseudo label masks for foreground ($\mathbf{Y}_{fg}$) and background ($\mathbf{Y}_{bg}$) independently as below, where $\sigma_{fg}$ and $\sigma_{bg}$ are the respective uncertainty threshold and further explained in Appendix~\ref{appendix:training}. The final pseudo label $\mathbf{Y}_p$ combines $\mathbf{Y}_{fg}$ and $\mathbf{Y}_{bg}$, with unlabeled pixels set to $u$.
\begin{align}\label{eq:E-step}
\mathbf{Y}_{fg} &= \mathbf{M}_s * \mathbf{M}_d * \mathds{1} (\mathbf{P}_s > \sigma_{fg}), \quad
\mathbf{Y}_{bg} = (1-\mathbf{M}_s) * (1-\mathbf{M}_d) * \mathds{1} (\mathbf{P}_s < \sigma_{bg})\\
%\end{align}
%The final pseudo label $\mathbf{Y}_p$ combines $\mathbf{Y}_{fg}$ and $\mathbf{Y}_{bg}$, with unlabeled pixels set to $u$.
%\begin{align}
\mathbf{Y}_p &= \mathds{1}(\mathbf{Y}_{fg} = 1) + u * \mathds{1}(\mathbf{Y}_{fg} = 0) * \mathds{1}(\mathbf{Y}_{bg} = 0)
\end{align}

\paragraph{Model Updating}
Given the generated pseudo label $\mathbf{Y}_p$, 
we update the parameters $\Theta$ of our SSN by minimizing the weighted cross entropy loss of segmentation probability $\mathbf{P}_s$ w.r.t. both the original weak labels and the generated pseudo labels: %, using the same auto-weighting strategy as in Sec~\ref{sec:bootstrap}: 
$\mathcal{L} (\Theta) = \lambda_w \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}) + \lambda_p \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}_p)$, 
%\begin{align}
%\mathcal{L}_{M} (\Theta) &= \lambda_w \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}) + \lambda_p \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}_p)
%%\mathcal{L}_{M} (\Theta) &= \frac{1}{N}\sum_{n=1}^N (\mathcal{L}_{wce}^n (\mathbf{P}_s, \mathbf{Y}^n) + \mathcal{L}_{wce}^n (\mathbf{P}_s, \mathbf{Y}_p))
%%\Theta &= \arg \min \mathcal{L}_{M} (\Theta)
%\end{align}
where 
$\lambda_w$ and $\lambda_p$ are the corresponding loss weights. 
Note that we fix the parameters $\Omega$ of our SDN in iterative learning. Based on empirical observation, updating $\Omega$ does not provide further improvement. This is mainly because our self-taught shape representation is of relatively good quality and our noise augmentation is strong enough to capture various error modes. %In practice, we compute gradients and update $\Theta$ in each minibatch with SGD.

%
%\begin{comment}
%\clearpage
%\section{Method cvpr}
%    \begin{figure*}[t!]
%        \centering 
%        \includegraphics[width=1.00\textwidth]{fig/b_model_shape.png}
%        \caption{(a) Overview of our method. (b) An example of our shape prior and corresponding augmentation effect on trachea.}
%        %\caption{Our model consists of two main modules: Semantic Segmentation Network (SSN) and Shape Denoising Network (SDN). Our SSN predicts an initial segmented mask from the input volumetric image, which is further refined by our SDN as final output. To train our model, we first initialize the whole system by training our SSN on weak labels. Then we extract a self-taught shape prior with our SSN and use it as the training signal with our specifically designed noise augmentation, to train our SDN. Moreover, we further improve our model with an EM strategy. In E-step, we generate pseudo masks with both outputs from SSN and SDN, utilizing a simple uncertainty filtering mechanism. In M-step, we optimize our SSN with loss on both weak labels and pseudo labels.}
%        \label{fig:model}
%%        \vspace{-0.3cm}
%    \end{figure*}
%
%We now introduce our method for weakly supervised volumetric segmentation. The main idea of our method is to learn a self-taught shape prior from weak labels and then to utilize this prior for further shape denoising and refinement. To achieve this, we develop a deep neural network consisting of two main modules: a Semantic Segmentation Network (SSN) and a Shape Denoising Network (SDN). Our SSN first predicts an initial segmentation mask from the input image volume and then our SDN applies the self-taught shape prior on this initial mask for denoising and refinement.
%
%In the remaining of this section, we will introduce our method design in detail. We start from the problem setting of weakly supervised volumetric segmentation in Sec.~\ref{sec:setting}, 
%followed by the network design of SSN and SDN in Sec.~\ref{sec:net}. 
%%followed by the network design of SSN in Sec.~\ref{sec:SSN} and SDN in Sec.~\ref{sec:SDN}. 
%To learn our network, we adopt an Expectation-Maximization (EM) strategy in Sec.~\ref{sec:learning}. 
%An overview of our framework is illustrated in Fig.~\ref{fig:model}.
%
%
%\subsection{Problem Setting}\label{sec:setting}
%Given an input volumetric image $\mathbf{I} \in \mathbb{R}^{H \times W \times D}$, our goal is to estimate its segmentation mask $\mathbf{M} \in \mathcal{S}^{H \times W \times D}$, where $H$ and $W$ are the height and width of an image slice, and $D$ is the number of slices. $\mathcal{S} = \{0, 1\}$ is the semantic label set with $0$ as background and $1$ as foreground.
%Assume we are given a training set $\mathcal{D} = \{\mathbf{I}^n, \mathbf{Y}^n\}_{n=1}^N$, where $\mathbf{Y}^n \in \mathcal{S}'^{H \times W \times D}$ is the corresponding weak label to $\mathbf{I}^n$, $\mathcal{S}' = \mathcal{S} \cup \{u\}$ with $u$ representing unlabeled pixels, and $N$ is the number of training data samples. 
%%Our model can learn from labeled pixels and generalize to unlabeled regions.
%We focus on single foreground class segmentation in this paper, while our method can be applied to multi-class problems by dealing with each class separately.
%
%
%
%
%\subsection{Network Architecture}\label{sec:net}
%Our model consists of two main modules: Semantic Segmentation Network (SSN) and Shape Denoising Network (SDN). Our SSN predicts an initial segmented mask from the input volumetric image, which is further refined by our SDN as final output.
%
%\paragraph{Semantic Segmentation Network}\label{sec:SSN}
%Our Semantic Segmentation Network (SSN) $\mathcal{F}_{SSN}$ takes a volumetric image $\mathbf{I}$ as input and outputs a probability map $\mathbf{P}_s \in [0,1]^{H\times W\times D}$, indicating the confidence of each pixel belonging to foreground. From $\mathbf{P}_s$ we can derive the initial foreground segmentation mask $\mathbf{M}_s$: 
%%\begin{align}
%%\mathbf{P}_s = \mathcal{F}_{SSN} (\mathbf{I}; \Theta), \quad \mathbf{M}_s = \mathds{1} (\mathbf{P}_s > 0.5) 
%%\end{align}
%$\mathbf{P}_s = \mathcal{F}_{SSN} (\mathbf{I}; \Theta), \mathbf{M}_s = \mathds{1} (\mathbf{P}_s > 0.5)$, 
%where $\Theta$ denotes the parameters of $\mathcal{F}_{SSN}$ and $\mathds{1}(\cdot)$ is the indicator function.
%We instantiate our SSN with nnU-Net~\cite{isensee2019automated}, which is the state-of-the-art model architecture for medical image semantic segmentation. Detailed network configurations are in Appendix Sec.\ref{sec:net_config}.
%
%
%\paragraph{Shape Denoising Network}\label{sec:SDN}
%Inspired by Denoising Autoencoder (DAE)~\cite{vincent2010stacked} and Augmented Autoencoder (AAE)~\cite{Sundermeyer_2018_ECCV}, we design a Shape Denoising Network (SDN) $\mathcal{F}_{SDN}$, aiming to encode a unified shape prior and then to apply to noisy input mask for shape refinement.
%DAE encodes an image into a latent embedding which is invariant to noise, to represent the original clean image.
%AAE produces the orientation encoding of the object in the input image, which is invariant to other transformation and environmental conditions.
%
%Different from these methods aiming for a representative embedding, our goal is to recover a clean and complete shape from an input mask.
%Given the initial mask $\mathbf{M}_s$ from SSN output, our SDN implicitly applies self-taught shape prior constraints and outputs a clean and shape-refined mask $\mathbf{M}_d$: 
%%\begin{align}
%%\mathbf{P}_d = \mathcal{F}_{SDN} (\mathbf{M}_s; \Omega), \quad \mathbf{M}_d = \mathds{1} (\mathbf{P}_d > 0.5) 
%%\end{align}
%$\mathbf{P}_d = \mathcal{F}_{SDN} (\mathbf{M}_s; \Omega), \mathbf{M}_d = \mathds{1} (\mathbf{P}_d > 0.5)$, 
%where $\Omega$ denotes the parameters of $\mathcal{F}_{SDN}$. Since we aim for the final mask rather than the latent embedding, our $\mathcal{F}_{SDN}$ shares the same U-Net architecture as $\mathcal{F}_{SSN}$, which keeps a larger spatial resolution at its bottleneck than traditional autoencoder networks, and includes skip connections to capture more mask details.
%
%
%%\begin{figure}[t!]
%%    \centering 
%%    \includegraphics[width=0.45\textwidth]{fig/b_shape_aug2.png}
%%    \caption{Examples of our shape augmentation operations.} 
%%    \label{fig:shape_aug}
%%%    \vspace{-0.3cm}
%%\end{figure}
%
%
%\subsection{Model Learning}\label{sec:learning}
%We learn our model in an EM framework. Below we sequentially introduce an initialization step of our system by training our SSN on weak labels, then the training of our self-taught SDN, and finally our iterative learning with EM.
%
%
%\subsubsection{Initialization}\label{sec:bootstrap}
%Given input images and corresponding weak labels, we initialize our system by training our SSN with weighted cross entropy on labeled pixels: $\mathcal{L}_{SSN} (\Theta) = \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y})$. Due to highly imbalanced foreground and background in weak labels, we adopt an auto-weighting strategy in our loss function, to systematically balance labeled foreground and background into 1:1 for each volume. %Empirical experiments show that this auto-weighting strategy is more stable and has better generalization to different datasets, compared to a fixed weighting ratio.
%After the initialization, our SSN is able to provide initial segmentation masks, which serve as important training signals for our SDN.
%
%%\begin{align}
%%\mathcal{L}_{SSN} (\Theta) = \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y})
%%\end{align}
%
%
%
%\subsubsection{Training Shape Denoising Network}
%For denoising model learning, DAE~\cite{vincent2010stacked} applies artificial random noise to input images and reconstructs corresponding clean targets. AAE~\cite{Sundermeyer_2018_ECCV} proposed a domain randomization technique to mimic environment and sensor variations captured by real cameras. They utilize this technique to augment their input synthetic images and reconstruct images invariant to irrelevant factors other than orientation.
%
%In our case, one main difference is that we have no full mask annotations for self-supervised learning. One possible solution is to use digital synthetic shape models, which might have a large domain gap to different datasets, especially for masks in medical segmentation. To avoid such domain gap, we propose to extract a self-taught shape prior with our weakly supervised segmentation model SSN. The underlying assumption is that our trained SSN is able to generate masks with above-average accuracy for some instances in training split, and thus can supply those masks with better shape quality to help refine other masks. To this end, we first compute the average foreground probability of each $\mathbf{P}_s$ in training split as the confidence of each predicted mask $\mathbf{M}_s$, and then take the mask with the highest confidence as our self-taught shape prior $\mathbf{M}^*$ to train our SDN.
%
%To train our SDN for noise removal, instead of adding general random noise to input masks, we specifically design our noise augmentation, based on the observation of errors produced by our initially trained SSN.
%
%We summarize the typical error modes of initial segmentation masks in three categories: (1) Over-smoothed regions;
%(2) Wrongly attached blobs;
%(3) Over-prediction of foreground regions beyond starting and ending slices. 
%These errors occur mainly because there is no clear boundary supervision in weak labels, while the neighboring objects might share similar intensity or texture to the target object.
%
%To equip our SDN with capability to deal with the aforementioned errors, we design three corresponding noise augmentation operations: 
%(1) Closing; 
%(2) Dilation; 
%(3) Extension of marginal slices. 
%Examples are as shown in Fig.~\ref{fig:model} (b). %~\ref{fig:shape_aug}.
%Moreover, we apply spatial transformation including rotation, translation, and scaling, to capture rich position and size variations. Detailed information is in Sec.~\ref{sec:implementation}.
%
%We augment our self-taught shape prior mask $\mathbf{M}^*$ into $\mathbf{\hat{M}}^*$, and train our SDN to reconstruct the clean mask $\mathbf{M}^*$ with cross entropy loss: $\mathcal{L}_{SDN} (\Omega) = \mathcal{L}_{ce} (\mathcal{F}_{SDN} (\mathbf{\hat{M}}^*; \Omega), \mathbf{M}^*)$.
%
%
%
%%\begin{align}
%%%\mathbf{\hat{P}}_d &= \mathcal{F}_{SDN} (\mathbf{\hat{M}}_s; \Omega)\\
%%\mathcal{L}_{SDN} (\Omega) &= \mathcal{L}_{ce} (\mathcal{F}_{SDN} (\mathbf{\hat{M}}^*; \Omega), \mathbf{M}^*)
%%\end{align}
%
%
%\subsubsection{Learning with EM}
%With initialized SSN and SDN, we further improve our model by an iterative EM learning that treats the full mask as the latent variable and finetunes the SSN model.
%
%
%\paragraph{E-step}
%In E-step, we generate pseudo masks from predictions of both SSN and SDN. Specifically, we first compute the intersection of the segmented mask from SSN and the shape-refined mask from SDN, and then apply a simple uncertainty filtering mechanism, based on the confidence of each pixel in $\mathbf{P}_s$ from the semantic segmentation output. We compute pseudo label masks for foreground ($\mathbf{Y}_{fg}$) and background ($\mathbf{Y}_{bg}$) independently as below, where $\sigma_{fg}$ and $\sigma_{bg}$ are the respective uncertainty threshold and further explained in Appendix. The final pseudo label $\mathbf{Y}_p$ combines $\mathbf{Y}_{fg}$ and $\mathbf{Y}_{bg}$, with unlabeled pixels set to $u$.
%\begin{align}\label{eq:E-step}
%\mathbf{Y}_{fg} &= \mathbf{M}_s * \mathbf{M}_d * \mathds{1} (\mathbf{P}_s > \sigma_{fg}), \quad
%\mathbf{Y}_{bg} = (1-\mathbf{M}_s) * (1-\mathbf{M}_d) * \mathds{1} (\mathbf{P}_s < \sigma_{bg})\\
%%\end{align}
%%The final pseudo label $\mathbf{Y}_p$ combines $\mathbf{Y}_{fg}$ and $\mathbf{Y}_{bg}$, with unlabeled pixels set to $u$.
%%\begin{align}
%\mathbf{Y}_p &= \mathds{1}(\mathbf{Y}_{fg} = 1) + u * \mathds{1}(\mathbf{Y}_{fg} = 0) * \mathds{1}(\mathbf{Y}_{bg} = 0)
%\end{align}
%
%
%
%\paragraph{M-step}
%In M-step, we update the parameters $\Theta$ of our SSN by minimizing a weighted cross entropy loss of segmentation probability $\mathbf{P}_s$ w.r.t. both the original weak labels and the generated pseudo labels, using the same auto-weighting strategy as in Sec~\ref{sec:bootstrap}: 
%\begin{align}
%\mathcal{L}_{M} (\Theta) &= \lambda_w \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}) + \lambda_p \mathcal{L}_{wce} (\mathbf{P}_s, \mathbf{Y}_p)
%%\mathcal{L}_{M} (\Theta) &= \frac{1}{N}\sum_{n=1}^N (\mathcal{L}_{wce}^n (\mathbf{P}_s, \mathbf{Y}^n) + \mathcal{L}_{wce}^n (\mathbf{P}_s, \mathbf{Y}_p))
%%\Theta &= \arg \min \mathcal{L}_{M} (\Theta)
%\end{align}
%where 
%$\lambda_w$ and $\lambda_p$ are the corresponding loss weights. 
%Note that we fix the parameters $\Omega$ of our SDN in our iterative learning with EM. Based on our empirical observation, updating $\Omega$ does not provide further improvement. This is mainly because our self-taught shape prior mask is of relatively good quality and our noise augmentation is strong enough to capture various error modes. In practice, we compute gradients and update $\Theta$ in each minibatch with SGD.
%\end{comment}
%
%
%


\begin{comment}
\section{Method draft}



% Problem setting
We consider weakly-supervised semantic segmentation, which aims to learn to segment semantic objects from only training images with only weak labels. To this end, we adopt an expectation-maximization (EM) strategy that iteratively improve unobserved segmentation masks util reaching a convergence. In this work, we focus on signle-class setting below. \footnote{It is straightforward to generalize our formulation to the multi-class setting by treating each semantic class separately.}

Formally, for each segmentation task $T$, we denote its dataset as $\mathcal{D}^l=\{\mathbf{X}_n, \mathbf{Y}_n\}^N_{n=1}$, where $\mathbf{X}\in \mathbb{R}^{H\times W\times D}$ are the input volumes, $\mathbf{Y}_n \in \{0, 1, 255\}^{H \times W \times D}$ are weak label maps and value 255 stands for the unknown label. Our goal is to predict the unobserved semantic masks $\mathbf{Z}_n\in \{0, 1\}^{H\times W\times D}$, enforcing it to reach groundtruth masks as soon as possible.

\subsection{Overview}
In this work, we propose an shape aware EM learning framework to predict semantic segmentation. EM method usually estimates intermediate pseudo masks and refine them iteratively. The main idea of our method is to leverage 3D shape prior to refine pseudo masks in the shape level, and further avoid label error propagations. Furthermore, the shape prior take effect in improving prediction qualities in M-step. 

To this end, we develope the shape denoising network to encode shape prior in a self-taught learning strategy, due to lacking for groundtruth labels to learn the shape prior. We also design an uncertainty-aware mechanism in the supervision of M-step for stable improvement. Our network contains three parts: a segmentation network that predict semantic segmentations, a shape denoising network that refine predicted masks in the shape level, and an uncertainty-aware mechanism to generate pseudo masks with uncertain regions. An overview of our model is presented in Fig.\_\_, and we will explain model details in below subsections.



\subsection{Segmentation network}
The first module is a segmentation network that is fed into images then predict semantic segmentation maps. Following \_\_, we adopt a plain 3D U-Net architecture, which contains an encoder to capture semantic context and decoder that predict precise localization of semantic labels. Formally, Given an input image $\mathbf{X}\in \mathbb{R}^{H\times W\times D}$, the segmentation network generates a confidence score map $\mathbf{M_{score}}\in [0,1]^{H\times W\times D}$ as follows:
\begin{align}
\mathbf{M_{score}} = f_\text{SEG}(\mathbf{X}; \Theta)
\end{align}
where $\Theta$ are parameters of our segmentation network.

Specifically, three datasets produce respective input resolutions, accordingly correspond to different 3D U-Net versions. All versions share the same group of computation blocks, yet differs in the downsample and upsample stages. High resolution requires large downsample stages to encode context. We also employ deep supervision for boosting training process. Details of network configurations are shown in Sec.\ref{sec:net_config}.



\subsection{Shape denoising network}
% 1. aim: refine mask to revise its shape  
% 2. two challenges: lack for supervision, little samples (in shape space) (easy to overfit)
% 3. solution: bootstrap selection from confident predictions, shape augmentations.
Shape denoising network (SDN) aims to introduce shape-aware mask refinements for each object class. For clarity we focus on 3D shape prior incorporation in this paper because most medical organs are 3D objects. In the absence for groundtruth labels, lacking for faithful supervisions and sufficient training samples are two main challenges. We propose a novel \textbf{self-taught shape prior learning mechanism} for the shape denoising module. For the source of supervisions, we take a bootstrap selection from inital confident predictions. Then we design a set of shape error augmentations to enrich training samples in the shape space. In this way, our shape denoising network learns to eliminate irrational shape errors and generate improved shapes. 

Formally, shape denoising module takes as input the predicted hard-label mask $\mathbf{\hat{M}}\in \{0,1\}^{H\times W\times D}$, refine masks with improved shape $\mathbf{M_{refine}}\in \{0,1\}^{H\times W\times D}$.
\begin{align}
    \mathbf{\hat{M}} = \mathbb I(\mathbf{M_{score}} > 0.5)  \\
    \mathbf{M_{refine}} = f_\text{SDN}(\mathbf{\hat{M}}; \Phi)
\end{align}
where $\mathbb I$ is the indicate function and  and $\Phi$ are parameters of our shape denoising network. We take U-Net architecture as the shape denoising network, which is consistent with the segmentation network. To be clear, the differences falls on the function and input. Shape denoising network aims to refine input masks in the shape level, while segmentation network seeks to predict semantic masks from input images.

% main idea: self-taught, boostrap, novelty
% 1. source of supervision: initial confident predictions
% 2. observation: high probability -> high confidence -> good quality
% 3. select a single case: sufficient because of subsequent augmentions (enhance generalization ability)
We then introduce the source of shape supervisions. Instead of groundtruth labels, we seek for pseudo labels with expectional quality from initial predictions $\mathbf{M_{score}}$ of segmentation network, which are supervised by weak labels here. According to our observation, initial supervision provide some good-quality semantic masks, thus we can bootstap from these predicted labels. We also observe that segmentation network usually predicts high probabilities and superior masks for those easy cases, therefore we take as the confidence measure the average of probabilites over all voxels of object class. We denote voxels with a probability greater than 0.5 as $FG_{score} = \{v_{i}: v_{i} \in \mathbf{M_{score}} \land v_{i}>0.5 \}$ and the number sum of this set as $k$, then the confidence measure is $measure = \frac{1}{k} \sum_{v_{i}\in FG_{score}} v_{i}$. Specifically, we sort in order all sample predictions according to this confidence measure, afterwards select a single prediction as our starting point. In practice we select the most confident one as our target label. A single label is sufficient in our method, subsequently we augment the one with various shape augmentations to enhance its generalization ability.


% intro: shape error augmentation
% 1. shape augmentations: error mode, spatial transformations
% 2. error mode of weakly-supervised model - corresponding error augmentation
% 3. spatial transformation - generalization ability (imitate shape )
Next we present shape augmentations for training a shape denoising network. Two group of augmentations are used, one is shape error imitation, and the other is spatial transformation. To equip shape denoising network with denoising and revising ablities, during training we imitate frequent shape errors in weakly-supervised segmentation. We visualize those predictions and summarize error modes for each dataset, listed in Table.\_\_. In addition, each dataset contains a class of objects with significant scale, location, rotation variance. Hence we apply spatial transformations to augment the generalization ablity.

We summarize frequent shape error modes in weakly-supervised medical object segmentation.
\begin{enumerate}
    \item \textit{Wrongly attached organs.} Due to rare supervisions in border area and homogeneous features of adjacent organs, the segmentation network often predict masks with wrongly attached organs. e.g. attaching adjacent esophagus region to target trachea object.
    \item \textit{Blurry bifurcation.} The segmentation model has the tendency of predicting a blob of masks for originally seperate regions. e.g. blurry trachea bifurcation and aortic bifurcation.
    \item \textit{Mismatch of dataset bias.} Some datasets have unique biases in 3D shape. e.g. Predicted trachea masks are usually longer than expert annotations.
\end{enumerate}
Accordingly, we propose some augmentation operations to imitate these errors or biases. Note that these error augmentations are applied on input masks.
\begin{enumerate}
    \item \textit{Dilation operation.} We grow a number of attached areas near the border of object by 3D morphological dilation.
    \item \textit{Closing operation.} We apply 3D morphological closing operation for the whole object, to generate blurry bifurcations.
    \item \textit{Other dataset bias.} For trachea dataset, we take marginal extension along axial axis, to mimic prediction tendency.
\end{enumerate}
Finally, to generalize this shape model, we employ spatial transformations to generate various of scales and rotations. Spatial transformations take effect on both input and target masks.
\begin{enumerate}[resume]
    \item \textit{Spatial transformation.} 3D scaling and rotation are used with a specific probability.
\end{enumerate} 



\subsection{Model learning}
% 1. Initialization: segmentation model
% 2. Training the shape denoising model (adaptive: higher-quality input, higher-quality output )
% 3. EM strategy: uncertainty reweighting (in loss term), iterate each iteration.
We now introduce our weakly-supervised learning strategy of the EM framework. As a bootstrap, we train segmentation network with only weak labels from scratch, to acquire initialized parameters and preliminary predictions. Then we adopt a self-taught learning strategy to train the shape denoising network, which learns to adaptively revise the predicted mask with improved shape. Lastly, we alternate between the expectation step (E-step) and maximization step (M-step) until reaching the convergence.

\subsubsection{Bootstrap}
For dataset $\mathcal{D}^l=\{\mathbf{X}_n, \mathbf{Y}_n\}^N_{n=1}$, We first leverage weak labels $\{\mathbf{Y}_n\}^N_{n=1}$ to train an initialized segmentation model, where weighted cross-entropy loss is employed. 
\begin{align}
    \mathcal{L}_{SEG}(\Theta) = \frac{1}{N}\sum_{i=1}^{N} l_{wce}(f_{SEG}(\mathbf{X}_{i} ; \Theta),& \mathbf{Y}_{i}) \label{eq:supervised_loss}
\end{align}
Where $\mathcal{L}_{SEG}$ denotes the segmentation loss, $l_{wce}$ is the weighted cross-entropy loss, $\mathbf{X}_{i}$ is the input image, and $\mathbf{Y}_{i})$ is the corresponding weak label. Only those voxels with certain labels are supervised. Class weights are calculated according to their ratios, to mitigate highly imbalanced labels.

% 1. implicit label propagation (3D context, feature level)
% 2. defective predictions, not aware of global shape
On the supervision of sparse weak annotations, the segmentation network is able to achieve implicit label propagation, due to semantic context extraction by 3D convolutions and downsampling. However, initialized segmentation network usually predicts defective segmentation maps, and is not aware of the global shape. To avoid error propagation and revising the global shape during the EM process, shape denoising network is trained.

\subsubsection{Self-taught shape denoising network}
% 1. augmentated input and target, ce
% 2. condition: predict good on easy pattern
% 3. adaptive (higher-quality input, higher-quality output)
In the training phase of shape denoising network, the input mask $\mathbf{\hat{M}}$ is augmented with shape errors, then both the input $\mathbf{\hat{M}}$ and target masks $\mathbf{M_{refine}}$ are augmentated with consistent spatial transformations. Cross-entropy loss is used as the loss function.
\begin{align}
    \mathcal{L}_{SDN}(\Theta) = \frac{1}{K}\sum_{i=1}^{K} l_{ce}(f_{SDN}(\mathbf{\hat{M}}_{i} ; \Phi),& \mathbf{M_{refine}}_{i}) \label{eq:sdn_loss}
\end{align}
Where $\mathcal{L}_{SDN}$ denotes the shape denoising loss, K is total number of the training set after augmentations and $l_{ce}$ is the cross-entropy loss.


\begin{algorithm}
    \caption{Shape augmentation in training phase}
    \begin{algorithmic}
    \State \textbf{Input:} One single label.
    \State \textbf{Output:} Augmentated label set.
    \State \textbf{Procedure:}
        \begin{enumerate}
            \item Creat an empty label set.
            \item Reserve input mask as the target mask, and employ shape error augmentations on input mask.
            \item Employ spatial transformations on both input and target masks.
            \item Append current label pairs to the label set, and repeat above 2 steps until satisfying the given number.
            \item Return the label set.
        \end{enumerate}
    \end{algorithmic}
    \end{algorithm}

The shape denoising network works well in the condition that initialized segmentation network predict good on easy patterns. The self-taught mechanism requires a good-quality prediction as the target label, and usually those samples with easy patterns are chosen because of its high confidence score. However, if the segmentation model predict badly on all samples, selected target label is no longer accurate, subsequently the training process will fails in denoising and revising the global shape. More often, given weak labels are enough to avoid aforementioned circumstance, and self-taught shape denosing network is feasible.

Furthermore, we find that this trained shape denoising network is adaptable to inputs with diverse shape qualities. In other words, it improve bad-quality input to some extent, yet improve relatively good-quality input, too. In practice, it does not hurt shape qualities, yet when input qualities reach an upper bound, it lose its effect.


\subsubsection{EM learning strategy}
% E: estimate the pseudo label: contains SN, SDN, uncertainty filter mechanism
% M: update parameters of SN
We propose a novel EM framework for weakly-supervised semantic segmentation. E-step and M-step share the segmentation network and shape denoising network. In E-step, we estimate pseudo labels $\mathbf{M_{pseudo}}$ from input images $\mathbf{X}$. Going through segmentation network and shape denoising network, we acquire refined pseudo labels, then we apply an uncertainty filter mechanism to process pseudo label. In M-step, we update current estimates {$\Theta$} of model parameters using given pseudo labels and original weak labels.

To be clear, we update parameters of shape denoising network $\Phi$ only once, because its adaptable feature and expensive training cost. (We verify that updating once does not hurt final performance in the ablation study.) Formally, an E-step and M-step are alternated in model learning.\\

\textbf{E-step}: The posterior distributions of the latent variables (pseudo labels here) is estimated, by using current parameters \{$\Theta, \Phi$\}. 
\begin{align}
    &\mathbf{M_{score}} = f_\text{SEG}(\mathbf{X}; \Theta^{t}) \label{segnet forward} \\
    &\mathbf{\hat{M}} = \mathbb I(\mathbf{M_{score}} > 0.5) \label{indication forward} \\
    &\mathbf{M_{refine}} = f_\text{SDN}(\mathbf{\hat{M}}; \Phi) \label{shape forward} \\
    \mathbf{M_{pseudo}} = &UncertaintyFilter(\mathbf{M_{refine}}, \mathbf{M_{score}}) \label{uncertainty filter}
\end{align}

where $\mathbf{M_{score}}$ denotes the predicted segmentation map, $\mathbf{\hat{M}}$ is the hard-label predictions through a threshold 0.5 followed by an indicate function $\mathbb I$, $\mathbf{M_{refine}}$ is the refined semantic map by shape denoising network, $\mathbf{M_{pseudo}}$ is estimated pseudo label with uncertain regions. $\Theta^{t}$ is current paramter estimates of the segmentation model, and $UncertaintyFilter(\cdot)$ is our uncertainty filter method.

\begin{algorithm}
    \caption{Uncertainty filter mechanism}
    \begin{algorithmic}
    \State \textbf{Input:} segmentation maps $\mathbf{M_{score}}$ and $\mathbf{M_{refine}}$, predefined filter ratio $r$.
    \State \textbf{Output:} pseudo label $\mathbf{M_{pseudo}}$.
    \State \textbf{Procedure:}
        \begin{enumerate}
            \item Threshold $\mathbf{M_{score}}$ and $\mathbf{M_{pseudo}}$ to binary labels.
            \item Set inconsistent labels between $\mathbf{M_{score}}$ and $\mathbf{M_{refine}}$ as uncertain labels (label id 255).
            \item Sort in increasing confidence order all voxels of $\mathbf{M_{score}}$, filter out them according to the order, until reaching the ratio $r$. Filtered voxels are set with uncertain labels.
            \item Return the pseudo label that contains certain and uncertain labels.
        \end{enumerate}
    \end{algorithmic}
    \end{algorithm}

% 1. aim: filter uncertain labels
% 2. Input, output
% 3. insight: low probability -> low confidence | filter ratio: lots error
We design an uncertainty filter mechanism to process pseudo labels with low confidence. This mechanism aims to filter uncertain labels from estimated pseudo labels. We input $\mathbf{M_{score}} \in [0, 1]^{H\times W\times D}$  and $\mathbf{M_{refine}} \in [0, 1]^{H\times W\times D}$, by filtering voxels with relatively low confidences until reaching a predefined filter ratio, output new pseudo labels with uncertain regions $\mathbf{M_{pseudo}} \in \{0, 1, 255\}^{H\times W\times D}$. It is based on the observation that predicted probability map can be employed the confidence map, those predictions with low probability are not reliable. Moreover, we take a filter ratio of total predictions to ensure trustable pseudo labels, yet the approach of predefining a ratio can be explore in the future.
\\

\textbf{M-step}: We update the estimates of parameters {$\Theta$} by maximizing likelihood, i.e minizing loss terms as follows:
\begin{align}
    &\Theta^{t+1} = argmin_{\Theta}  \mathcal{L}_{SEG}(\Theta)   \label{param update} \\
    \begin{split}
    \mathcal{L}_{SEG}(\Theta) = &\frac{1}{N}\sum_{i=1}^{N} ( l_{wce}(f_{SEG}(\mathbf{X}_{i} ; \Theta), \mathbf{Y}_{i}) \\
    &+ \alpha * l_{wce}(f_{SEG}(\mathbf{X}_{i} ; \Theta), \mathbf{M_{pseudo}}_{i} ) ) \label{eq:weak pseudo loss}
    \end{split}
\end{align}

where $\Theta^{t+1}$ denotes the updated parameters, $\mathcal{L}_{SEG}(\Theta)$ consist of two loss terms, which supervised by weak label $\mathbf{Y}_{i}$ and pseudo labels ${M_{pseudo}}_{i}$ respectively. Two sources of supervisions are balanced by a hyper-paramter $\alpha$. 
\\




    \begin{algorithm}
        \caption{Iterative EM process}
        \begin{algorithmic}
        \State \textbf{Input:} Image $\mathbf{X}$, weak label $\mathbf{Y}$, initial paramters $\Theta$, $\Phi$
        % \Repeat
        \State \textbf{E-step:} Estimate pseudo labels $\mathbf{M_{pseudo}}$.
            \begin{align*}     
                &\mathbf{M_{score}} = f_\text{SEG}(\mathbf{X}; \Theta^{t}) \\
                &\mathbf{\hat{M}} = \mathbb I(\mathbf{M_{score}} > 0.5) \\
                &\mathbf{M_{refine}} = f_\text{SDN}(\mathbf{\hat{M}}; \Phi) \\
                &\mathbf{M_{pseudo}} = UncertaintyFilter(\mathbf{M_{refine}}, \mathbf{M_{score}})
            \end{align*}
        \State \textbf{M-step:} Update model paramters $\Theta$ with SGD.
            \begin{flalign*}
                \Theta^{t+1} = arg&min_{\Theta}  \mathcal{L}_{SEG}(\Theta) \\
                % \begin{split}
                \mathcal{L}_{SEG}(\Theta) = &\frac{1}{N}\sum_{i=1}^{N} ( l_{wce}(f_{SEG}(\mathbf{X}_{i} ; \Theta), \mathbf{Y}_{i}) \\
                &+ \alpha * l_{wce}(f_{SEG}(\mathbf{X}_{i} ; \Theta), \mathbf{M_{pseudo}}_{i} ) ) 
                % \end{split}
            \end{flalign*}
        % \Until reaching the convergence.
        \State \textbf{Repeat:} repeat E-step and M-step until reaching the convergence.
        \end{algorithmic}
        \end{algorithm}


% 1. iterative guarantee: shape denosing and uncertainty filter
% 2. we allow iteratively alternating
\textbf{Iterative EM}: Existing EM frameworks in weakly-supervised segmenation usualy iterate only once, because label error propagation from E-step to M-step will hurt segmentation performances in multiple iterations. Two mechanisms works for blocking error propagation in our model: shape denosing network and uncertainty filter mechanism. The former removes estimated errors by shape prior, while the latter filter those unreliable voxels according to predicted confidence maps. In such manner, our model allows iteratively alternating between E-step and M-step, hence pseudo labels and models can be updated util reaching the convergence.

For the task of weakly-supervised segmentation, we first train an initialized segmentation network under the supervision of weak labels. At the same time, by applying aforementioned self-taught learning approach, we acquire the shape denoising model. Afterwards, for each E-step, uncertainty-aware pseudo labels are estimated by forwarding input images through the segmentation nework, shape denoising network and uncertainty filter approach. In each M-step, parameters of the segmentation network are updated under the supervisions of pseudo labels and weak labels. We alternate E-step and M-step for multiple iterations.


%where an weighting modification is applied in pixel-level supervisions.
\end{comment}