%\vspace{-0.3cm}
\section{Experiment}
We evaluate our method on three public benchmarks with organs of distinctive shape properties, including trachea in SegTHOR~\cite{trullo2019multiorgan}, left atrium in 2018 Atrial Segmentation Challenge, and prostate in PROMISE12~\cite{litjens2014evaluation}. On each dataset, we compare with the state-of-the-art methods utilizing different weak annotations. 
Due to the page limit, we present results on PROMISE12 in Appendix~\ref{appendix:prostate}.

Below we first introduce dataset information in Sec.~\ref{sec:datasets} and implementation details in Sec.~\ref{sec:implementation}. Then we present our experimental results comparing to other methods in Sec.~\ref{sec:results}, followed by comprehensive ablation study in Sec.~\ref{sec:ablation}. %, to illustrate the effectiveness of each component of our design. 
Moreover, we conduct further analysis on our SDN for better understanding of the shape denoising mechanism in Appendix~\ref{appendix:analysis} and discuss our failure cases and potential future work in Appendix~\ref{appendix:future}.

%To evalutate our method, we conduct experiments on three public datasets and compare against the state-of-the-art approaches. Three representative organs are used here, which are prostate, trachea and left atrium respectively. 

%Below we first introduce details of three datasets in Sec.\ref{sec:datasets}, then present implementation details including network configuration and training in Sec.\ref{sec:implementation}. After that we report comparison of quantitative results in Sec.\ref{sec:results}, followed by visualization analysis in Sec.\ref{sec:visualization}. Finally, we show experiments of ablation study to validate the effectiveness of each component in Sec.\ref{sec:ablation}. 

\subsection{Datasets} \label{sec:datasets}

\paragraph{SegTHOR Challenge}
SegTHOR challenge\footnote{https://competitions.codalab.org/competitions/21145}~\cite{trullo2019multiorgan} 
consists of 60 thoracic CT scans, from patients diagnosed with lung cancer or Hodgkin’s lymphoma. It 
is an isotropic dataset with all scanner images in size of $512\times512\times(150-284)$. The in-plane spacing varies from 0.90 mm to 1.37 mm and the z-spacing changes from 2 mm to 3.7 mm. We split 40 publicly available scans into 30 for training and 10 for validation, and evaluate on 20 testing scans using the official challenge page. There are four organs in this dataset: heart, aorta, trachea and esophagus. We conduct experiments on trachea for its challenging and representative organ shape.
%【1. heart, aorta，esophagus 的形状没有代表性 2. trachea形状特殊性：分叉，location, size variance,圆管状 】

\paragraph{Left Atrium Dataset}
Left Atrium (LA) dataset is from 2018 Atrial Segmentation Challenge\footnote{http://atriaseg2018.cardiacatlas.org/}. It contains 100 pairs of 3D gadolinium-enhanced MR imaging scans (GE-MRIs) and LA segmentation masks. All scans are isotropic and have spacing of $0.625\times0.625\times0.625 mm^{3}$ . The in-plane resolution of the MRIs varies for each patient, while all MRIs have exactly 88 slices in z-axis. We split 100 scans into 60 for training, 20 for validation, and 20 for testing.

% 【Prostate: 形状相对简单，但边缘较难分出；不同的采集标准带来了一定的挑战。】
% 【Trache: 注：本来是40 training case, 20 test case online。如果有时间最后提交一次final test dice，但一般由于提交次数的限制，我们无法全部列出test dice】
% 【LA的特点：1. 形状更复杂：多分叉，size variance  2.数据集内部形状更多样，不是统一的shape】





For all datasets, we use the standard evaluation metric Dice coefficient (Dice).






\subsection{Implementation Details} \label{sec:implementation}


% ratio, scribble: line width, major axis, bbox distance
%In order to simulate the situations where only weak annotations are available, we generate weak annotations from pixel-level annotations for all datasets, with increasing labeled ratios (10\%, 30\%, 50\%) in slice level. Our weak annotations consist of one inner scribble and outer bounding box. For each labeled slice, one inner scribble with line width 3 pixel is marked, which is roughly a major axis for the foreground region. We also generate a foreground bounding box with a random distance of 10-20 pixels to real boundaries. For those slice that contains multiple connected region, we mark one region randomly. Note that we mark the head and tail slices of target object for each case.



%\paragraph{Model Training}

To fit 3D images into our network, we adopt a prior crop data preprocessing based on weak annotations. Specifically, we first resample image volumes into the same spacing, then align all training volumes based on their centers and pad them to the same size, and finally crop 1.2 times the size of the union cube of weakly annotated pixels from all volumes. 
All pixels outside our loose bounding boxes and slices beyond the starting and ending slices are used as background labels.

For our noise augmentation in training SDN, more detailed operations are as follows:
(1) Closing. We use standard morphological closing by first dilating and then eroding images. Closing provides over-smoothed masks which is a common error for trachea and left atrium.
(2) Dilation. We first randomly choose a center near mask boundary, and then dilate it with a random number of iterations. For anisotropic datasets like prostate, we adopt 2D dilation, while for isotropic datasets we utilize 3D dilation.
(3) Extension of marginal slices. We elongate masks by simply copying a few starting or ending slices in z-axis.
Random spatial transformations are used by default with a probability of 0.2 to enrich shape variations.
%For spatial transformation, we adopt random scaling of range (0.7, 1.4), rotation angles of (-pi, pi) along the axial axis for anisotropic datasets while (-pi/6, pi/6) along all axes for isotropic datasets, with probability of 0.2.


%Following nnU-Net~\cite{isensee2019automated}, we utilize the same image augmentation and deep supervision for model training, with batch size of 2. 
%To train our model, we use the SGD optimizer. 
%For bootstrapping, we train SSN with initial learning rate of 1e-2 and decay it to 1e-3 in "poly" learning policy, for 200 epochs.
%To train SDN, we use constant learning rate of 1e-2 for 100 epochs.
%In iterative learning with EM, we use slightly different parameters for different datasets. Take trachea for example, we set loss weights $\lambda_w$ and $\lambda_p$ to 1 and 100, and train our model for a maximum of another 300 epochs with learning rate of 1e-3.
%uncertainty threshold

More detailed hyper-parameters and training settings are included in Appendix~\ref{appendix:training}.


%Due to large resolution of 3D medical images, we use a mini-batch of 2 images. Next we introduce details in phases of bootstrap, shape model training and EM alternating.

% Seg: aug, loss, SGD, lr, epoch
%During training the segmentation network in bootstrap phase, we employ a set of data augmentations referring to \ref{nnUNet}. The model is trained by the weighted cross-entropy loss, with SGD optimizer and initial learning rate 1e-2. The learning rate varies in a poly decay style. We train it for 200 epochs.

% SDN: noise aug, loss, ...
%For the shape denosing network, we take shape augmentations including shape error augmentations and spatial transformations. Table.\_\_ lists shape augmentation details. Specifically, dilation opetation is used in all datasets, we randomly choose a center near input boundaries, and dilate it with a given iterations. Closing operations are applied in trachea and left atrium datasets, we process the whole input with a given closing iterations. We employ marginal extension in the trachea dataset, for which we elongate marginal slices with copied last few slices. For spatial transformations, we apply a scaling and rotation operations, both with a probability of 0.2. We employ the cross-entropy loss with SGD optimizer, a constant learning rate 1e-2 and total 100 epochs. 

% EM
%When alternating EM steps, we take a combined loss of weak-label and pseudo-label supervisions, where we adopt SGD optimizer and learning rates \{1e-2, 1e-3, 1e-3\} for prostate, trachea and left atrium respectively. We iterate E-step and M-step per mini-batch, and train it with a maximum epoch 300.




\begin{table}[t!]
	\centering
	\caption{Quantitative results on the test splits of trachea and left atrium. All presented numbers are in Dice [\%]. '--' under 10\% denotes that BoxPrior failed in predicting any foreground.}
	\label{tab:test_result}        
	\resizebox{\textwidth}{!}{
		\begin{tabular}{c|c|c c c c |c c c c}
			\toprule
			\multirow{2}{*}{Method} & \multirow{2}{*}{Annotation} & \multicolumn{4}{c|}{Trachea (Test)} & \multicolumn{4}{c}{Left Atrium (Test)}    \\ 
			&                        & 100\% & 50\% & 30\% & 10\%  & 100\% & 50\% & 30\% & 10\%                                         \\ \midrule
			nnU-Net~\cite{isensee2019automated}     & Full label        & \multicolumn{4}{c|}{89.74}           & \multicolumn{4}{c}{92.63} \\ \cmidrule{1-10}
			BoxPrior\cite{kervadec2020bounding}    & Box  & 79.82  & 48.78  & -- & -- & 83.93  & 83.51  & 81.86 & --           \\
			KernelCut\cite{tang2018regularized}   & Scribble* & 84.39  & 83.44  & 82.77  & 67.78 & 78.97  & 76.71  & 74.42  & 64.93            \\
            Ours & Scribble* & 84.61 & 83.88 & 83.37 & 81.82 & 85.61 & 84.11 & 83.26 & 83.11 \\
            KernelCut\cite{tang2018regularized}   & Hybrid & 84.74 & 83.55	& 83.38	& 76.43 & 77.54	& 76.72	& 73.64	& 67.27            \\
            Ours        & Hybrid    & \textbf{85.54} & \textbf{83.97} & \textbf{83.78} & \textbf{83.19} & \textbf{86.31} & \textbf{86.25} & \textbf{83.81} & \textbf{83.41}                                \\
			\bottomrule
		\end{tabular}
	}
%\vspace{-0.3cm}
\end{table}

%\subsection{Visualization Analysis} \label{sec:visualization}
% \begin{enumerate}
%     \item Our method generate segmentation with clean and complete shapes, which are able to handle various 3D shapes.
%     \item For trachea, remove outer noisy region, and attached adjacent false-predcited organs.
%     \item For prostate, denoise and recover 3D continuity.
%     \item For left atrium, accurate bifurations and boundaries, better global shape.
% \end{enumerate}
% Figure: visualization





\subsection{Results} \label{sec:results}

% \begin{enumerate}
%     \item We show quantitative results on three datasets.
%     \item For each dataset, we decrease weak label ratio from 100\% to 50\%, 30\% and 10\%, to validate label efficiencies.
%     \item We compare with SOTA methods: Kernelcut, BoxPrior, in the same label cost.
%     \item Fixing our EM strategy, we compare hybrid label and scribble label
% \end{enumerate}

    % points
% \begin{enumerate}
%     \item Our label + Our EM strategy achieves best performance for three datasets, and close to the upper bound. (effective method)
%     \item Our label + Our EM strategy works for different segmentation tasks of medical organs, validate its generalization ablility.
%     \item Hybrid label style performs comparable or better than scribble label style.
%     \item Our Em strategy outperforms SOTAs
%     \item In our mehod, label ratio of 30\% is close to saturate for model training, and achieves high performance.
%     \item As the label ratio decreases, our method drop less, and be close to the upper bound. (label efficiency)
% \end{enumerate}

We compare our method using our proposed annotation, to state-of-the-art methods using scribble or box labels at the same cost. 
In Table.~\ref{tab:test_result}, we show quantitative results on labeled foreground slice ratios ranging from 100\% to 10\%.
For fair comparison, different annotation types share the same labeled slices in each setting. 
Scribble* denotes using the same long axis of our hybrid label as foreground annotation and taking our loose box edges as background, while our hybrid also takes region out of boxes as background, encoding more shape context. 
We consider the cost of hybrid, scribble, and box for a slice roughly the same, and explain justification details in Appendix~\ref{appendix:weak_label}.

%The quantitative results in Table.~\ref{tab:final_result} show that 
Ours+Hybrid consistently outperforms KernelCut~\cite{tang2018regularized} and BoxPrior~\cite{kervadec2020bounding}, especially on 10\% labeled-slice setting, with a large gap of 15.41\% on trachea and 18.48\% on left atrium compared to KernelCut, where BoxPrior fails to predict any foreground due to extreme imbalance in their regularization loss terms. Note that our method is robust in different label density settings and our performance only decreases mildly with fewer labeled slices, which verifies the capability of our model to utilize feature correlation and self-taught shape prior to fill in missing labels. Moreover, KernelCut does not perform well on data without clear foreground boundaries and distinct neighborhood like left atrium, BoxPrior fails to handle small and distant multi-connected regions like trachea, while our method is robust to all these challenges.
%Image examples are shown in Fig.~\ref{fig:all_labels}.
%We also report performance on test split of Left Atrium in Table.~\ref{tab:test_result}, showing that our method has good generalization ability. We cannot show test results on other two datasets as their evaluations for test are not available.
We also report the results of Ours+Scribble* and KernelCut+Hybrid, showing that our method can still outperform KernelCut with the same label and that shape context information encoded by hybrid label can further boost performance. 
Some qualitative results are in Appendix~\ref{appendix:qualitative}.

%We compare of our method with the previous state-of-the-art methods, and summarize quantitative results on three datasets in Table.\ref{tab:final_result}. We first show the upper-bound performance achieved by a fully-supervised network, followed by performance comparisons of various methods, we also present results in different label ratios. For each dataset, we decrease labeled ratio of weak annotations from 100\% to 50\%, 30\% and 10\%, to validate label efficiencies. 
% a. the same cost, our outperform SOTAs
%Under the same label costs of 100\% ratio, we compare our proposed method with state-of-the-arts methods Kernelcut\cite{tang2018regularized} and BoxPrior\cite{kervadec2020bounding}. Among the weakly-supervised methods, BoxPrior fails on the trachea and left atrium datasets, indicating its generalization limitations, while our method achieves the top performance and significantly outperform Kernelcut (\_\_\% on trachea, \_\_\% on prostate, \_\_\% on left atrium). 
% b. our method decrease from 100\%->10\%, drop less
%We also consider a more challenging setting in which we decrease label ratios. As the label ratio decreases, our method outperforms Kernelcut with a large margin, for example of ratio 30\%, +\_\_\% on trachea, +\_\_\% on prostate, +\_\_\% on left atrium. Meanwhile, our method drops less and remains high performances, validating its robustness in sparse annotations.
%\textbf{To write: interpretation of our method (effect)}
%% c. Hybrid comparable or better than scribble label: scribble here is indeed the upper bound of annotations
%\textbf{To write: hybrid vs. scribble labels}
%\textbf{To write: discussions on each dataset}


%To better understand our self-taught shape prior in EM framework, we visualize predicted segmentation masks in Figure.\_\_. out model can cope with various 3D shapes in diverse medical domains. 
%\\ \textbf{To write: the whole effect} 
%\\ \textbf{To write: comparisons on 3 datasets respectively} 
%\\ \textbf{Todo: PLOT Corresponding 3D visualizations!}


%\begin{table}[t!]
%	\centering
%	\resizebox{\textwidth}{!}{
%		\begin{tabular}{c c c c c c c || c c c c c c}
%			\toprule
%			\multirow{2}{*}{Method} & \multirow{2}{*}{Shape prior} & \multirow{2}{*}{EM}  & \multicolumn{3}{c}{Trachea} &\multirow{2}{*}{} &\multirow{2}{*}{}  & \multirow{2}{*}{Method} & \multirow{2}{*}{Annotation} & \multicolumn{3}{c}{Trachea} \\ %\cline{4-6}
%			&                       &              & 50\% & 30\% & 10\%         &   &   &     &                        & 50\% & 30\% & 10\%                    \\ \midrule
%			Ours      & $\checkmark$  & $\checkmark$      & \textbf{83.37} & \textbf{83.59} & \textbf{83.57} 
%			&   &   &Ours      & Hybrid      & 83.37 & \textbf{83.59} & \textbf{83.57} \\ \midrule
%			& CRF      & $\checkmark$      & 81.08 & 80.53 & 80.83 
%			&   &   & & Scribble*      & \textbf{83.60}  & 82.75 & 81.63 \\
%			& --            & $\checkmark$      & 76.48 & 77.30 & 77.82 
%			&   &   & & Scribble (20-50)      & 78.91 & 75.11 & 75.45 \\ %\hline  \hline
%			& $\checkmark$  & --                & 74.17 & 74.80 & 74.17 
%			&   &   & & Scribble (dilation)      & 78.86 & 77.99 & 76.60 \\ \\
%			Baseline      & --            & --                          & 69.17 & 68.39 & 62.50 
%			&   &   & & Box     & 82.95 & 81.84 & 79.59 \\
%			\bottomrule 
%			\multicolumn{7}{c}{(a)} & \multicolumn{6}{c}{(b)}\\
%		\end{tabular}
%	}
%	\caption{(a) Ablation study on our model components. We conduct experiments in a drop-one-out manner. CRF denotes replacing our SDN with DenseCRF. (b) Ablation study on annotations. All scribbles share the same foreground label as our hybrid label. Scribble* denotes taking our loose box edges as background, Scribble (20-50) also denotes adopting loose box edges but with larger distance of 20-50 pixels to its tight box, and Scribble (dilation) represents scribbles generated by ground truth foreground dilation.}
%	\label{tab:ablation_em}        
%	%\vspace{-0.3cm}
%\end{table}

\subsection{Ablation Study} \label{sec:ablation}
%In this subsection, we conduct several detailed experimental studies to examine the effectiveness of our model components on trachea datasets.
%To illustrate the importance of each component in our method design, 
We conduct comprehensive ablation study on our model in Table.~\ref{tab:ablation_model} and on hybrid label in Table.~\ref{tab:ablation_annotation}. 
Ablation on loss terms is in Appendix~\ref{appendix:ablation_loss}.

\textbf{Shape prior} SDN utilizes a self-taught shape prior for shape refinement. 
CRF denotes replacing SDN with DenseCRF, 
%Replacing SDN with DenseCRF 
which makes performance drop 2\%-3\%. Besides, DenseCRF post-processing on a 3D volume takes about 3.50s, while inference of our SDN only needs 0.02s.
Moreover, %\add{removing SDN} results in performance drop of 5\%-7\%.
removing SDN, i.e., removing $\mathbf{M}_d$ and $(1-\mathbf{M}_d)$ in Eq.~\ref{eq:E-step}, results in a performance drop of 4\%-5\%.

\textbf{Iterative} Iterative learning incorporates learned shape prior to iteratively improve our segmentation model. 
%feeds shape-refined mask back to SSN for better segmentation learning. 
Without iterative learning for refinement, model performance drops about 9\%.

\textbf{Annotation} We compare our hybrid label to different types of scribbles and box. 
The results on hybrid label outperform other labels with considerable gaps, showing that with more shape context, hybrid is more informative than scribble or box at the same cost. 
All scribbles share the same foreground label as our hybrid label. 
Scribble* denotes taking our loose box edges as background, 
Scribble (20-50) also denotes adopting loose box edges but with larger distance of 20-50 pixels to its tight box, and Scribble (dilation) represents scribbles generated by ground truth foreground dilation. See Appendix~\ref{appendix:weak_label} for more details. 
Scribble*, which directly derives from our hybrid label, achieves the best results compared to other scribble variations, showing that it is probably the most informative version of scribble at the same labeling cost. 
For Box, we first generate foreground and background labels from labeled slices with GrabCut, and use them as weak labels in our method. %Note that trachea has relatively clear boundary and large intensity difference from its neighborhood, which makes the initial GrabCut results cleaner than on other datasets.
%\add{To apply our method to bounding boxes only, we first generate initial segmentation proposals by GrabCut\cite{rother2004grabcut} then treat them as weak labels.} \quest{This part could be compressed.}
%\quest{This part could be compressed.}


\begin{table}[t]
	%	\hspace{0.01\textwidth}
	\begin{minipage}[t]{0.48\textwidth}
		\centering
		\caption{Ablation study on our model components. We conduct experiments in a drop-one-out manner.}
		\label{tab:ablation_model}
		\resizebox{\textwidth}{!}{
			\renewcommand{\arraystretch}{1.0}
			\begin{tabular}[t]{c c c|c c c}
				\toprule
				\multirow{2}{*}{Method} & \multirow{2}{*}{Shape prior} & \multirow{2}{*}{Iterative}  & \multicolumn{3}{c}{Trachea (Val)} \\ %\cline{4-6}
				&                       &              & 50\% & 30\% & 10\%                 \\ \midrule
				Ours      & $\checkmark$  & $\checkmark$      & \textbf{83.45} & \textbf{83.18} & \textbf{83.18} \\ %\\
				%            \midrule
				& CRF      & $\checkmark$      & 81.08 & 80.48 & 80.36 \\
				& --            & $\checkmark$      & 78.97 & 78.59 & 78.11 \\ %\hline
				& $\checkmark$  & --                & 74.91 & 74.80 & 74.17 \\
				Baseline      & --            & --                          & 69.17 & 68.39 & 62.50 \\
				\bottomrule 
			\end{tabular}
		}
	\end{minipage}
	\hspace{0.035\textwidth}
	\begin{minipage}[t]{0.48\textwidth}
		\centering
		\caption{Ablation study on annotations. We compare our hybrid label to different types of scribbles and box.}
		\label{tab:ablation_annotation}
		\resizebox{\textwidth}{!}{
			\renewcommand{\arraystretch}{0.91}
			\begin{tabular}[t]{c c|c c c}
				\toprule
				\multirow{2}{*}{Method} & \multirow{2}{*}{Annotation} & \multicolumn{3}{c}{Trachea (Val)} \\ %\cline{4-6}
				&                        & 50\% & 30\% & 10\%                 \\ \midrule            
				Ours &   Hybrid  & \textbf{83.45} & \textbf{83.18} & \textbf{83.18} \\
				%            \midrule
				& Scribble*      & 83.05  & 82.75 & 81.63 \\
				& Scribble (20-50)      & 78.91 & 75.80 & 75.45 \\ %\hline  \hline
				& Scribble (dilation)      & 78.86 & 77.99 & 76.60 \\
                & Box                   & 82.25 & 81.70 & 80.60 \\
				% Baseline    & Scribble      & --            & --                          & 68.35 & 65.68 & 62.98 \\   \hline  \hline
				%             & Box     & 82.95 & 81.84 & 79.59 \\ 
				\bottomrule 
			\end{tabular}
		}
	\end{minipage}
%	\vspace{-0.5cm}
\end{table}
