% Supplementary parts
\clearpage
%\begin{figure}[t!]
%	\centering 
%	\includegraphics[width=0.8\textwidth]{fig/a_intro2.png}
%	\caption{(a) Examples of our proposed weak annotations on left atrium. For slices with more than one connected components, we only choose one to annotate. (b) Examples of our shape denoising results. From top to bottom: trachea, left atrium, and prostate.} 
%	\label{fig:advertise}
%	%      	\vspace{-0.3cm}
%\end{figure}

\input{tex/related.tex}

\section{Weak Annotation Strategy}\label{appendix:weak_label}
To simulate our weak annotation strategy, we derive our annotations from full masks. 
For \textbf{foreground long axis}, we generate six lines across the mass center of a mask, to uniformly split the mask with intersection angles of 30 degree, and select the longest line within the mask, leaving distance of 5 pixels from line ending points to mask boundary. For slices with multiple connected foreground components, we only randomly label one. For our \textbf{loose bounding box}, we generate each edge of the box with distance of 10-20 pixels to the corresponding tight bounding box edge. The statistics in Table.~\ref{tab:label_edge} show that a threshold of 10-20 pixels for loose box is reasonable in practice. We also generate scribble and tight box for comparison. Note that our hybrid label can also be used as scribble. Another generated scribble shares the same foreground label as our long axis, but uses a curve denoting background label from dilation of the foreground mask with 20-50 iterations. All scribbles are in width of three pixels. 

We consider the annotation cost of hybrid, scribble, and tight bounding box for a slice roughly the same, and explain details as below.
For our hybrid label, we require four points to label a slice as described in Sec.~\ref{sec:implementation}. According to~\cite{bearman2016s}, it takes workers a median of 2.4s to click on the first instance of an object, and 0.9s to click on every additional instance of an object class in PASCAL VOC 2012~\cite{everingham2010pascal}, which makes our hybrid label roughly $2.4 + 0.9\times3 = 5.1$s. 
For scribble, \cite{bearman2016s} also claims that for every present class, it takes 10.9s to draw a free-form scribble on the target class.
In Table.~\ref{tab:test_result} and Table.~\ref{tab:final_result}, Scribble* denotes using the same long axis of our hybrid label as foreground annotation, while taking our loose box edges as background. This makes the cost of Scribble* the same as our hybrid label. Moreover, we conduct further experiments on different scribbles in Sec.~\ref{sec:ablation}, showing that Scribble* is the most effective and informative type of scribble comparing to some other options.
For tight bounding box, \cite{papadopoulos2017extreme} proposed extreme clicking for box annotation, which takes 7s to click on four extreme points to annotate a box. 
Based on the evidence above, we claim our hybrid label, scribble, and tight bounding box share roughly the same labeling cost.



%\section{Results on Test Split of Trachea Dataset}
%We report quantitative results on the test split of trachea dataset in Table.~\ref{tab:test_result_tra}, by submitting predicted masks to SegTHOR challenge page\footnote{https://competitions.codalab.org/competitions/21145}. We were not able to report the results on the test split of trachea dataset in our main paper, because the official website took a random amount of time to return the results (mostly more than 20 hours) and the overall number of submissions are limited. 
%%In addition, it does not support evaluation for trachea alone, which took us some time to debug the submission format. 
%%Moreover, we could not submit our results for all settings in parallel in case any errors in format occurred, because the overall number of submissions are limited. 
%%Due to those issues above, we report the results on the test split of trachea dataset in our supplementary.
%
%The comparisons further verify the effectiveness and generalization capability of our method.


\begin{table}[t!]
	\centering
	\caption{Statistics of 2D foreground mask size in three datasets, presented by calculating the length of all tight bounding box edges in pixel.}
	\label{tab:label_edge}        
	\resizebox{0.48\textwidth}{!}{
		\begin{tabular}{c |c c c}
			\toprule
			& Trachea     & Left Atrium   & Prostate \\
			\midrule
			mean ($\pm$std) & 24 ($\pm$14) & 82 ($\pm$41)   & 86 ($\pm$42) \\
			\bottomrule
		\end{tabular}
	}
\end{table}

\begin{table*}[t!]
	\centering
	\caption{Network configurations for trachea, left atrium and prostate datasets.}
	\label{tab:net_params}        
	\resizebox{\textwidth}{!}{
		\begin{tabular}{c| c c c}
			\toprule
			& Trachea & Left Atrium & Prostate  \\ \midrule
			Input spacing (mm)     & $2.50\times1.63\times1.63$ & $0.63\times1.33\times1.33$  & $2.20\times0.72\times0.72$  \\
			Input resolution & $128\times112\times128$ & $96\times160\times128$  & $40\times160\times192$  \\
			Pooling strides & $[[2,2,2], [2,2,2], [2,2,2], [2,2,2], [2,1,1]]$ & $[[2,2,2], [2,2,2], [2,2,2], [2,2,2]]$  & $[[1,2,2], [1,2,2], [2,2,2], [2,2,2], [2,2,2]]$  \\
			Convolution kernel sizes & $[[3,3,3], [3,3,3], [3,3,3], [3,3,3], [3,3,3], [3,1,1]]$ & $[[3,3,3], [3,3,3], [3,3,3], [3,3,3], [3,3,3]]$  & $[[1,3,3], [1,3,3], [3,3,3], [3,3,3], [3,3,3], [3,3,3]]$  \\
			\bottomrule
		\end{tabular}
	}
\end{table*}

\begin{figure*}[t!]
	\renewcommand\arraystretch{0.005}
	\begin{center}
		\resizebox{\textwidth}{!}{
			%\scalebox{0.72}[0.72]{
			\begin{tabular}{c}
				\raisebox{-0.9\height}{\includegraphics[width=\textwidth]{fig/s_net.png}}\\
				(a) Network architecture for trachea dataset.\\
				\raisebox{-0.9\height}{\includegraphics[width=\textwidth]{fig/s_net2.png}}\\
				(b) Network architecture for left atrium dataset.\\
				\raisebox{-0.9\height}{\includegraphics[width=\textwidth]{fig/s_net3.png}}\\
				(c) Network architecture for prostate dataset.\\
			\end{tabular}
		}
	\end{center}
	\caption{Network architectures. We adopt the same architecture for Semantic Segmentation Network (SSN) and Shape Denoising Network (SDN) but with independent parameters.}
	\label{fig:s_net_all}
\end{figure*}

\section{Network Architecture} \label{appendix:net_config}


%\paragraph{Network Architecture} \label{sec:net_config}
For our SSN, we utilize a 3D U-Net structure following nnU-Net~\cite{isensee2019automated}, including an encoder and a decoder. 
Our detailed network architectures for each dataset are shown in Fig.~\ref{fig:s_net_all} and the corresponding network configurations are shown in Table.~\ref{tab:net_params}. 
Both the encoder and the decoder consist of four or five layers, depending on the input resolution. We define a computation block with three operations in sequence: {conv - instance norm - leaky ReLU}. Each layer contains two computation blocks which do not change feature spatial resolution. We implement downsampling with strided convolutions and upsampling with transposed convolutions. 
%The bottleneck feature spatial resolution is in between 4 to 8 in all three different dimensions. 
%We train the segmentation network from scratch for all datasets.
For our SDN, we adopt the same network architecture as our SSN with independent parameters.


% For each dataset, report their SSN parameters and SDN parameters. In addition, {input resolution}
%We follow the rules of network design in~\cite{isensee2019automated}. Our detailed network architectures for each dataset are shown in Fig.~\ref{fig:s_net_all} and the corresponding network configurations are shown in Table.~\ref{tab:net_params}.

%We present network configurations for three datasets in detail. We utilize 3D U-Net structures following nnU-Net~\cite{isensee2019automated} design principles. Due to distinct input spacings and resolutions, we configure three groups of downsampling strides and corresponding convolutional kernel sizes, summarized in Table~\ref{tab:net_params}. Note that they share the same basic blocks such as convolution, instance normalization and ReLU. Presented in Fig~\ref{fig:s_net}, the network of trachea contains five downsampling stages, to extract rich feature context in the bottleneck. We also show other two network configurations in Fig~\ref{fig:s_net2} and Fig~\ref{fig:s_net3}.



\begin{algorithm}[t!]
	\caption{Training procedure.}
	% \hspace*{0in} {\bf Input:} The model $\mathcal{S}$; Step $t$; Batch-size $B$
	
	\textbf{Input}: Weakly-labeled training set $\mathcal{D} = \{\mathbf{I}^n, \mathbf{Y}^n\}_{n=1}^N$; \\
	\textbf{Output}: Network parameters $\Theta, \Omega$
	
	\begin{algorithmic}
		% \State \textcolor{gray}{// sample images to build mini-batch $\mathcal{N}_t$}
		\State /* Initialize SSN */
		\State Train SSN with weak labels;
		\State /* Train SDN */
		\State Compute confidence of all predicted masks by SSN on training split;
		\State Select the mask with the highest confidence as the self-taught shape representation;
		\State Augment the shape representation with designed noise and spatial transformation as input;
		\State Train SDN to reconstruct the clean shape;
		\State /* Iterative learning */
		\State Repeat:
		\State \qquad Generate pseudo labels by combining outputs of SSN and SDN with uncertainty filtering;
		\State \qquad Update SSN with generated pseudo labels and weak labels;
		\State Until reaching the maximum epoch
	\end{algorithmic}
	\label{alg:training}
\end{algorithm}

\section{Model Training}\label{appendix:training}
% General: batch size, sgd, lr, epoch, loss weights
%% bootstrap, SDN, EM
We present the training procedure in Algorithm.\ref{alg:training}.
Following nnU-Net~\cite{isensee2019automated}, we utilize the same image augmentation and deep supervision for model training, with batch size of 2. 
To train our model, we use the SGD optimizer. 
For initialization, we train SSN with initial learning rate of 1e-2 and decay it to 1e-3 in "poly" learning policy, for 200 epochs.
To train SDN, we use constant learning rate of 1e-2 for 100 epochs.
In iterative learning, we use slightly different parameters for different datasets. 
In uncertainty filtering, for each volume, we first sort the pixels in the predicted segmentation confidence map $\mathbf{P}_s$, and then set the uncertainty threshold $\sigma_{fg}$ to filter out the less confident 70\% of all predicted foreground pixels for trachea. The ratio for left atrium and prostate is 50\%. The corresponding $\sigma_{bg}$ for the background of each dataset is set to filter out double pixels of filtered foreground.
In model updating, for trachea, we set loss weights ($\lambda_w$, $\lambda_p$) as (1, 100), and train our model for a maximum of another 300 epochs with learning rate of 1e-3. As for prostate, we set ($\lambda_w$, $\lambda_p$) as (0.1, 10) and learning rate as 1e-2. For left atrium, we set ($\lambda_w$, $\lambda_p$) as (0.1, 10) and learning rate as 1e-3.

% uncertainty threshold: trachea 0.7, la 0.5, prostate 0.5
% IN E-step, uncertainty thresholds $\sigma_{fg}$ and $\sigma_{bg}$ are calculated according to 

% Sepecified: EM: lr, loss weights





%\begin{figure*}[t!]
%	\centering
%	\begin{subfigure}{1.\textwidth}
%		\centering 
%		\includegraphics[width=0.9\textwidth]{fig/s_net.png}
%		\caption{Network architecture for trachea dataset.}
%		\label{fig:s_net}
%	\end{subfigure}
%	\begin{subfigure}{1.\textwidth}
%		\centering 
%		\includegraphics[width=0.9\textwidth]{fig/s_net2.png}
%		\caption{Network architecture for left atrium dataset.}
%		\label{fig:s_net2}
%	\end{subfigure}
%	\begin{subfigure}{1.\textwidth}
%		\centering 
%		\includegraphics[width=0.9\textwidth]{fig/s_net3.png}
%		\caption{Network architecture for prostate dataset.}
%		\label{fig:s_net3}
%	\end{subfigure}
%	\caption{Network architectures. We adopt the same architecture for Semantic Segmentation Network (SSN) and Shape Denoising Network (SDN) but with independent parameters.}
%	\label{fig:s_net_all}
%\end{figure*}
%
%\begin{figure*}[t!]
%    \centering 
%    \includegraphics[width=0.9\textwidth]{fig/s_net.png}
%    \caption{The Network architecture for trachea dataset. We adopt the same architecture for semantic segmentation network (SSN) and shape denoising network (SDN), yet each with with independent parameters.}
%    \label{fig:s_net}
%\end{figure*}
%
%\begin{figure*}[t!]
%    \centering 
%    \includegraphics[width=0.9\textwidth]{fig/s_net2.png}
%    \caption{The Network architecture for left atrium dataset.}
%    \label{fig:s_net2}
%\end{figure*}
%
%\begin{figure*}[t!]
%    \centering 
%    \includegraphics[width=0.9\textwidth]{fig/s_net3.png}
%    \caption{The Network architecture for prostate dataset.}
%    \label{fig:s_net3}
%\end{figure*}




\section{Experiment on PROMISE12}\label{appendix:prostate}

\paragraph{PROMISE12 Challenge}
PROMISE12 challenge~\cite{litjens2014evaluation} contains 50 transversal T2-weighted MR images in multiple scanning protocols\footnote{https://promise12.grand-challenge.org}, with the segmentation target prostates in the central area of images.
%with various diseases such as benign and prostate cancers. 
All cases in this dataset are anisotropic, with spacing ranging from $2\times0.27\times0.27 mm^{3}$ to $4\times0.75\times0.75 mm^{3}$. Following the same setting as~\cite{kervadec2020bounding}, we split 50 scans into 40 for training and 10 for validation, and report results on the validation split as testing is no longer available.

Quantitative results are shown in Table.~\ref{tab:final_result}. On an organ with relatively simple shape like prostate, our method also consistently outperforms KernelCut and BoxPrior, especially with a large margin on 10\% labeled-slice setting.

To compare our method to Boxprior on Hybrid, we conducted experiments on the 100\% setting of prostate dataset. We added a Partial Cross-Entropy (PCE) loss to Boxprior to train on Hybrid, as BoxPrior+PCE+Hybrid, and the result is 72.56\%, which is much lower than Ours+Hybrid. Moreover, to investigate the contribution of the PCE loss and the original BoxPrior loss, we conducted experiments of PCE loss only and BoxPrior loss only on Hybrid. With PCE loss only, the result is 73.15\% and already higher than BoxPrior+PCE+Hybrid. With BoxPrior loss only, the result is 59.82\%. Note that we tried extensive hyper-parameters tuning for BoxPrior+PCE+Hybrid, including their $w$ parameter, weight of each loss term, and learning rate, based on their experiments on thick boxes in their code.


\begin{table*}[t!]
	\centering
	\caption{Quantitative results on the validation split of PROMISE12. All presented numbers are in Dice [\%]. '--' under 10\% denotes that BoxPrior failed in predicting any foreground.}
	\label{tab:final_result}        
	\resizebox{0.7\textwidth}{!}{
		\begin{tabular}{c|c|c c c c}
			\toprule
			%			\hline \hline
			\multirow{2}{*}{Method} & \multirow{2}{*}{Annotation} & \multicolumn{4}{c}{Prostate (Val)} \\ %\cline{3-14}
			%\cmidrule{3-14}
			&                        & 100\% & 50\% & 30\% & 10\%\\ 
			\midrule
			nnU-Net~\cite{isensee2019automated}     & Full label        & \multicolumn{4}{c}{91.11} \\ \cmidrule{1-6}
			
			
			% Ours        & Scribble          &           & 83.88         & 82.75         & 81.63         & 86.43     & 84.91           & 85.02         & 83.72 & 85.55         & 84.44         & 84.49          & 80.59           \\ 
			
			BoxPrior\cite{kervadec2020bounding}    & Box           & 83.82           & 80.60         & 76.93 & --    \\ 
			KernelCut\cite{tang2018regularized}   & Scribble*          & 78.68           & 77.13         & 76.72 & 72.84 \\ 
			Ours & Scribble* & 85.55 & 84.09 & 83.87 & 80.59 \\
            KernelCut\cite{tang2018regularized}   & Hybrid          & 80.18           & 77.90         & 77.58 & 73.45 \\ 
            Ours        & Hybrid            & \textbf{86.01}         & \textbf{85.71}         & \textbf{85.56}          & \textbf{80.89}           \\ 
			% &               &           &           &           &           &           &           &           &           &           &           &           & \\ \hline
			\bottomrule
		\end{tabular}
	}
\end{table*}

% Table: final results
%\begin{table*}[t!]
%	\centering
%	\resizebox{\textwidth}{!}{
%		\begin{tabular}{c|c|c c c c|c c c c|c c c c}
%			\toprule
%			%			\hline \hline
%			\multirow{2}{*}{Method} & \multirow{2}{*}{Annotation} & \multicolumn{4}{c|}{Trachea}  & \multicolumn{4}{c|}{Left Atrium} & \multicolumn{4}{c}{Prostate} \\ %\cline{3-14}
%			%\cmidrule{3-14}
%			&                        & 100\% & 50\% & 30\% & 10\% & 100\% & 50\% & 30\% & 10\% & 100\% & 50\% & 30\% & 10\%\\ 
%			\midrule
%			nnU-Net~\cite{isensee2019automated}     & Full label        & \multicolumn{4}{c|}{89.04}           & \multicolumn{4}{c|}{92.61}           & \multicolumn{4}{c}{91.11} \\ \cmidrule{1-14}
%			Ours        & Hybrid            & \textbf{83.62}     & \textbf{83.45}         & \textbf{83.59}         & \textbf{83.57}         & \textbf{86.68}     & \textbf{86.64}           & \textbf{84.64}         & \textbf{83.88} & \textbf{85.95}         & \textbf{86.03}         & \textbf{86.55}          & \textbf{80.89}           \\ 
%			% Ours        & Scribble          &           & 83.88         & 82.75         & 81.63         & 86.43     & 84.91           & 85.02         & 83.72 & 85.55         & 84.44         & 84.49          & 80.59           \\ 
%			KernelCut\cite{tang2018regularized}   & Scribble*         &  83.19  & 81.11 & 82.07         & 61.92    & 78.61         & 76.32          & 73.45          & 63.09    & 76.13           & 77.13         & 76.72 & 72.84 \\ 
%			BoxPrior\cite{kervadec2020bounding}    & Box           & 78.44    &  48.08 &  0.75        & --        & 83.50         & 83.81          & 82.33         & --        & 83.82           & 80.60         & 76.93 & --    \\ 
%			% &               &           &           &           &           &           &           &           &           &           &           &           & \\ \hline
%			\bottomrule
%		\end{tabular}
%	}
%	\caption{Quantitative results on the validation splits of all three datasets. All presented numbers are in Dice [\%]. '--' under 10\% denotes that BoxPrior failed in predicting any foreground.}
%	\label{tab:final_result}        
%\end{table*}



\section{Analysis on Shape Denoising Network} \label{appendix:analysis}
In this section, we further discuss our Shape Denoising Network (SDN). The design of our SDN is based on the assumption that our Semantic Segmentation Network (SSN) trained on weak labels is able to provide initial masks, which would serve as a good starting point for SDN. We empirically found that some of the instances in the training split "look like" they have better quality than other instances, i.e., with clean and complete shapes. Then we compute the confidence of each predicted mask by calculating the average probability over its foreground pixels and found that the ones with the highest confidence usually have the best quality in shape. Therefore, we take the mask with the highest confidence predicted by SSN as the correct shape to train our SDN. Our SDN is never updated thereafter, mainly because the initial selected mask has good enough quality in shape, and empirically we found that updating SDN with new shapes did not provide further improvement.

Note that we do not use any ground truth masks of training data in our model training or inference, but we can retrospectively verify if the selected mask is of good quality by calculating its dice with the ground truth. Take the 30\% setting on Trachea for example, the dice of the selected shape is 87.89\%. After the entire process of our model learning, its dice only improves to 90.62\%. which is not much changed in its shape.

%\add{We believe our shape denoising network approximately captures the true underlying shape. As explained above, our selected shape is of good quality and more importantly, it is augmented with spatial transformation, which can cover a reasonable range of shape variations and form the manifold of the true shape data. Additionally, if the shape denoising network had no concept of the true underlying shape, it would not be able to distinguish a wrongly attached blob (from augmentation or SSN prediction) from a part of the true shape, and might randomly remove some parts of the input mask.}

We further analyze our SDN design in Table.~\ref{tab:ablation_aug}.
All of our augmentation operations contribute to improving SDN for noise and error removal, especially with dilation which increases performance by 3.83\%. Moreover, we utilize shape instances with different confidence from SSN predictions to train our SDN. Comparison among the results of using shapes with different confidence ranks shows that the most confident case provides the best shape prior information, while our design is also robust to the shape selection criteria. Choosing a shape prediction with relatively low confidence (case rank 30) still provides improvement to baseline, which shows that our noise augmentation and denoising learning is still able to perform well without a high-quality shape representation.

From the observation of initially trained SSN, we found that organs within a dataset typically have very similar shapes. The main variations can be captured by some mild spatial transformations, including translation, rotation, and scale. Therefore, we augment a single selected shape with random spatial transformation to form the shape distribution. Empirically we found that using a single shape with augmentation is sufficient for this task. We conducted experiments where we selected top 1, 1-3, or 1-5 shapes with the highest confidence to train the SDN, and using more shapes does not provide further improvement. For volumetric segmentation with large variety in shape, our method can potentially extend to multi-shape learning, e.g., resorting to standard clustering methods to obtain several templates, training an independent autoencoder for each shape, and obtain final results by voting based on similarity.



% Table: ablation aug
% Table: ablation EM
\begin{table}[t!]
	\centering
	\caption{Analysis of our SDN on the validation split of trachea dataset with 30\% labeled slices. We present the contribution of each augmentation operation in an incremental manner. Moreover, we also show the effect of using different cases as our shape representation for training. Case rank denotes the confidence rank of the selected shape representation.}
	\label{tab:ablation_aug}        
	\resizebox{0.65\textwidth}{!}{
		\begin{tabular}{c c c c c c c}
			\toprule
			& Case rank & Closing     & Dilation     & Extension      & Dice [\%]  \\ \midrule
			Baseline & 1 & --              & --                & --                & 68.39        \\
			& 1 & $\checkmark$    & --                & --                & 69.19         \\ %\hline
			& 1 & $\checkmark$    & $\checkmark$      & --                & 73.02        \\ %\hline
			Our SDN  & 1 & $\checkmark$    & $\checkmark$      & $\checkmark$      & \textbf{74.80}             \\ \midrule
			& 2 & $\checkmark$    & $\checkmark$      & $\checkmark$      & 74.62             \\ 
			& 15 & $\checkmark$    & $\checkmark$      & $\checkmark$      & 74.05             \\ 
			& 30 & $\checkmark$    & $\checkmark$      & $\checkmark$      & 72.62             \\ 
			\midrule
			& Top 1-3 & $\checkmark$    & $\checkmark$      & $\checkmark$      & 74.08             \\
			& Top 1-5 & $\checkmark$    & $\checkmark$      & $\checkmark$      & 73.98             \\
			\bottomrule
		\end{tabular}
	}
	%\vspace{-0.3cm}
\end{table}


\section{Further Analysis}

\subsection{Qualitative Results}\label{appendix:qualitative}

\begin{figure}[t!]
	\centering 
	\includegraphics[width=0.65\textwidth]{fig/final_vis3.png}
	\caption{Qualitative results on three datasets with 30\% labeled slices. \textbf{Top to bottom}: trachea, left atrium, and prostate.}
	\label{fig:vis}
\end{figure}

We visualize some qualitative results in Fig.~\ref{fig:vis}, demonstrating that our method provides masks with cleaner shape compared to other 2D methods.


%\subsubsection*{EM Learning Strategy.}

% \begin{enumerate}
%     \item validate the effeciveness of EM strategy, our EM strategy can boost a lot.
%     \item validate the effeciveness of shape denosing network, shape prior is significant in weaklys-supervised segmentation
%     \item validate the combination of shape prior and EM strategy, improve each other in an iterative style.
%     \item EM and Shape prior works for scribble labels, too. %(也许scribble也需要列够四行，来显示ablation.)
% \end{enumerate}
%Table.\ref{tab:ablation_em} shows the quantitative results of different model settings. The first row is our method, by decomposing the shape prior from our framework, the dice performance drops \_\_\% in the second row. Moreover, replacing our iterative EM with one-step EM worsen segmentation results by \_\_\% in dice. 
%\\\textbf{To write: ablation of hybrid and scribble labels.}

% Table: ablation EM


%\subsubsection*{Shape Augmentations.}
% \begin{enumerate}
%     \item validate dilation, it can remove attached noisy object regions. (denoise)
%     \item validate marginal extension, dataset bias encoding
%     \item validate closing operation, obtain clean and accurate bifurations, avoid blurry boundaries.
%     \item validate spatial transformations, they can augment 3D shapes and improve shape model learning.
% \end{enumerate}
%We also conduct experiments to validate components of shape augmentations. Table.\ref{tab:ablation_aug} shows segmentation performances for different augmentation operations, on the trachea dataset with 30\% label ratio. Spatial transformations are preliminarily necessary for augmentating the training set, hence we apply them by default. We first present dice performance of bootstrap baseline. By adding dilation operations, we improve segmentation performance by \_\_\%, which indicates our model learns to denoise extraneous regions. Applying closing operation furtherly brings an improvement of \_\_\%. Finally, we append marginal extension that simulates the dataset bias, and boost up with \_\_\% dice. 

\subsection{Ablation on Loss Terms}\label{appendix:ablation_loss}
We also investigate the effect of our two loss terms for model training. As shown in Table.~\ref{tab:ablation_loss}, our method with only pseudo label for iterative refinement training can significantly outperform baseline by more than 10\%. With both weak label and pseudo label, our method can further improve performance by 1\%-2\%.

\begin{table}[t!]
	\centering
	\caption{Ablation study on two loss terms of our method.}
	\label{tab:ablation_loss}        
	\resizebox{0.6\textwidth}{!}{
		\begin{tabular}[t]{c c c|c c c}
			\toprule
			\multirow{2}{*}{Method} & \multirow{2}{*}{Weak label} & \multirow{2}{*}{Pseudo label}  & \multicolumn{3}{c}{Trachea (Val)} \\ %\cline{4-6}
			&                       &              & 50\% & 30\% & 10\%                 \\ \midrule
			Baseline  & $\checkmark$  & --      & 69.17 & 68.39 & 62.50 \\
			& --      & $\checkmark$  & 81.89 & 81.57 & 81.14 \\
			Ours      & $\checkmark$  & $\checkmark$      & \textbf{83.45} & \textbf{83.18} & \textbf{83.18} \\
			\bottomrule 
		\end{tabular}
	}
\end{table}



% Table: Details of shape augmentations.
% \begin{table*}[]
%     \centering
%     \caption{Details of shape augmentations.}
%     \label{tab:shape_aug}        
%     \begin{tabular}{llllll}
%     \hline \hline
%                 & dilation operation & closing operation & marginal extension & scaling & rotation \\ \hline \hline
%     \multirow{2}{*}{Prostate}    & $\checkmark$                 & --        & --       & $\checkmark$       & $\checkmark$       \\
%                                  & $n\in[2,4], iter\in[3,6]$    &                   &                    & $p=0.2, range\in[0.7, 1.4]$    & $p=0.2, range\in[-\pi, \pi] along axial axis$    \\\hline
%     \multirow{2}{*}{Trachea}     & $\checkmark$                 & $\checkmark$      & $\checkmark$      & $\checkmark$       & $\checkmark$          \\
%                                  & $n\in[1,2], iter\in[5,11]$   & $iter\in[5,11]$   & $n_{head}\in[1,9],n_{tail}\in[4,11]$& $p=0.2, range\in[0.7, 1.4]$    & $p=0.2, range\in[-\pi/6, \pi/6] along 3 axes$    \\\hline
%     \multirow{2}{*}{Left Atrium} & $\checkmark$                 & $\checkmark$      & --                & $\checkmark$       &  $\checkmark$         \\ 
%                                  & $n\in[5,15], iter\in[5,15]$  & $iter\in[5,15]$   &                    & $p=0.2, range\in[0.7, 1.4]$    & $p=0.2, range\in[-\pi/6, \pi/6] along 3 axes$    \\
%     \hline \hline
%     \end{tabular}
%     \end{table*}


% Failure cases.
\begin{figure}[t!]
	\centering 
	\includegraphics[width=0.75\textwidth]{fig/failure2.png}
	\caption{Two failure cases on prostate dataset.}
	\label{fig:failure}
\end{figure}


\subsection{Failure Case and Future Work} \label{appendix:future}
We show two failure cases in Fig.~\ref{fig:failure}, which have incomplete or over-predicted masks, despite the obvious intensity similarity or boundary in these slices. This is mainly because our model trained on weak labels, explores no boundary information or constraints during training. This could be further improved by incorporating low-level image feature such as boundary or superpixel into our model.


%\subsection{Discussion on 3D Box Annotation}
%\add{
%We want to briefly discuss 3D box annotation, which provides a unique pattern of label information. It might be a valuable future direction for weakly supervised volumetric segmentation, with specific design for the information 3D boxes provided. But we do not compare our method, KernelCut and Boxprior on 3D boxes in this paper for several reasons: 
%Firstly, it is nontrivial to propose a guideline for 3D box annotations and hard to estimate the cost of it. To our best knowledge, 3D boxes are much harder to annotate than 2D weak labels, as annotating a 3D box requires comparison among all slices, while annotating 2D weak labels mostly requires observation on single slices. One possible solution is to use our proposed labeling scheme to obtain 3D boxes, which would require annotating 100\% slices and is obviously not the most efficient strategy for 3D boxes. 
%Secondly, 3D boxes will probably cause performance drop of methods developed on tight 2D boxes. 3D boxes usually lead to loose 2D boxes on most slices. However, looser boxes can lead to worse performance, as discussed in Table 2 of BoxPrior, where a margin of 10 pixels could cause a performance drop of 5\% for their method. In contrast, we show statistics of length of all tight bounding boxes in Table.~\ref{tab:label_edge} of our paper, where the standard deviation is usually much larger than 20 (if considering a margin of 10 on both sides). 
%Finally, our method and KernelCut are designed for weak labels that supply both foreground and background information, while 3D boxes only provide loose background information, which is even looser than the background provided by Hybrid or Scribble* discussed in our paper. We would anticipate worse performance of both our method and KernelCut on 3D boxes than on Hybrid or Scribble*, even if we apply these methods to 3D boxes in some way. 
%In summary, drawing 3D boxes has no guarantee in reducing annotation cost, yet will probably hurt performance of these three methods, comparing to existing weak annotations. Therefore, we leave the exploration on 3D box annotation to future work.
%}
