\section{Baselines, Experiments and Results}
\label{sec:results}
\resultsTwoDimensional
\resultsThreeDimensional
\threeDimensionalDataDescription

\sloppy %%% this is a trick to avoid lines sticking out

We measure the performance of \EmbedSeg  against several state-of-the-art baseline methods that have been developed for microscopy instance segmentation. % in biology.
For 2D images, we tested all methods on three publicly available datasets, namely the \emph{BBBC010} \textit{C. elegans} brightfield dataset~\cite{ljosa2012}\footnote{We used the \textit{C. elegans} infection live/dead image set version 1 provided by Fred Ausubel and available from the Broad Bioimage Benchmark Collection}, the \emph{Usiigaci} NIH/3T3 phase-contrast dataset~\cite{tsai2019}, and the \emph{DSB} data from the Kaggle Data Science Bowl challenge of 2018~\cite{caicedo2019}\footnote{We used a subset of the image set BBBC038v1, available from the Broad Bioimage Benchmark Collection}. 
For volumetric images, we tested all methods on four new datasets (\emph{\Organoid}, \emph{\PlatynereisLive}, \emph{\MouseSkull}, and \emph{\PlatynereisFixed}), which we make available with publishing this work. 
Additional details can be found in Table \ref{tab:data3d}. 

\fussy %%% this switches back to more beautiful typesetting

\miniheadline{Chosen Baseline Methods}
% - - - - - - - - - - - - - - - - - 
\textit{Cellpose}~\cite{stringer2020} is a spatial-embedding based instance segmentation method where the task of the network is to predict a flow at each pixel. 
This ground truth vector flow field is pre-computed from the instance masks as solution to the heat diffusion equation, assuming a heat source placed at the center of the object instance.
These learnt flows are followed, during inference, to group pixels which arrive at the same location.
%
\textit{PatchPerPix}~\cite{hirsch2020} is a method that predicts a dense binary mask per pixel. These learnt local per-pixel (per-voxel) shape descriptor masks are, during inference, assembled into complete object instances. 
%
\textit{StarDist}~\cite{schmidt2018} and \textit{StarDist-3D}~\cite{weigert2020} are recently the arguably most widely applied methods in microscopy image analysis.
StarDist predicts at each pixel (voxel) the distance to the boundary (outline) of the surrounding object along a given set of directions (rays). 
A \textit{3-Class Unet}~\cite{ronneberger2015} is another widely adopted method for semantic segmentation, \ie the assignment of one of three classes (background, foreground, border) to each pixel (voxel). 
During inference, pixels (voxels) of a given class are typically clustered into instance segmentations by finding connected components.

Cellpose, next to offering code for training, also offers a public model, trained on a huge and diverse set of training data. 
Hence, below we report not only the performance of Cellpose trained on each dataset individually, but also how well the public model performs (see Table~\ref{tab:results2d}).


\miniheadline{Data and Data Handling in 2D}
% - - - - - - - - - - - - - - - - - 
The \textit{BBBC010 dataset} consist of only 100 images of $696\times 520$ pixels each.
Like others before us, we randomly split these images in two equally sized sets, one used for training, the other to evaluate performance (testing).
We cropped $256\times 256$ patches that are centered around each ground truth object (worm) and have used 15\% of all crops as validation set.
Reported results are averages over 9 independent data-splits and training runs.
%
For the \textit{Usiigaci dataset}, we split the 50 images of size $1024\times 1022$ pixels as suggested by Tsai~\etal~\cite{tsai2019} in 45 training and 5 test images.
We cropped $512\times 512$ patches that are centered on all ground truth objects.
%
The \textit{DSB dataset} is the largest collection of images, of which we use the same subset as originally suggested in~\cite{schmidt2018}.
It contains a total of $497$ images of variable size and is pre-split in $447$ training and $50$ test images. 
We train on object-centered $256\times 256$ crops. 
For the DSB and Usiigaci datasets, we hold out 15 \% of all training images chosen randomly for validation purposes, prior to cropping, and also average results over 9 independent runs.

\miniheadline{Data and Data Handling in 3D}
% - - - - - - - - - - - - - - - - - 
The \textit{\Organoid dataset} is the largest collection of 3D images, consisting of $108$ volumes of $70 \times 378 \times 401$ (Z, Y, X) voxels each.
We randomly select 15 and 11 images for validation and testing, respectively. 
Training is performed on object-centered  crops of size $32 \times 200 \times 200$.
%
The \textit{\PlatynereisLive dataset} contains $9$ images ($113 \times 660 \times 700$ voxels each), of which we randomly select 2 and 2 images for validation and testing, respectively.
Training is performed on object-centered  crops of size $32 \times 136 \times 136$.
%
The \textit{\MouseSkull dataset} contains only $2$ images of $209 \times 512 \times 512$ and $125 \times 512 \times 512$ voxels respectively. Due to very limited amount of available data, we test on the sub-volume $(:,:,256$$:$$512)$ of the second image.
Training is performed on the remaining data using object-centered crops of size $96 \times 128 \times 128$.
%
The \textit{\PlatynereisFixed dataset} also contains $2$ images of $515 \times 648 \times 648$ voxels each.
We test the performance on the the sub-volume $(300$$:$$405, :, :)$ of the second image and train on object-centered crops of size $80\times80 \times 80$ on the remaining data.

\vspace{2mm}
\noindent For all 3D datasets, we report the average results on the test data over 3 independent runs. 

\miniheadline{Training Details}
% - - - - - - - - - - - - - - - - - 
All results obtained with \EmbedSeg and the method by Neven~\etal on 2D datasets use the Branched ERF-Net~\cite{romera2018, neven2019} architecture, the Adam optimizer~\cite{kingma2014adam} with a decaying learning rate $\alpha_i = 5 e^{-4}\left[1 - \frac{i}{200}\right]^{0.9}$, where $i$ denotes the current epoch.
For training and inference on 3D datasets, we  propose a Branched ERF-Net operating on 3D convolutions (see Appendix~\ref{sec:network-architecture} for a schematic of our proposed architecture).

For the BBBC010 data, we use a batch size of $1$ without virtual batch multiplier, while for other datasets we employ a batch-size of $2$ and a virtual batch multiplier of $8$ (giving us an effective batch-size of $16$).

During training, axis-aligned rotations and flips were used for augmenting the available data.
Every training was run for $200$ epochs, and the model with the best performance \wrt IoU on the validation data is later used for reporting results on the evaluation data (see Tables~\ref{tab:results2d} and \ref{tab:results3d}). 


\miniheadline{Performance Evaluations}
% - - - - - - - - - - - - - - - - - 
All results on 2D images are compared using the Mean Average Precision ($\text{AP}_{\text{dsb}}$ score~\cite{schmidt2018}), at IoU thresholds ranging from $0.5$ to $0.9$ (see Table~\ref{tab:results2d}), while the results on volumetric images are evaluated on at IoU thresholds ranging from $0.1$ to $0.9$ (see Table~\ref{tab:results3d}).
For all \EmbedSeg \and Neven~\etal \results, we compute the minimum object size in terms of the number of interior pixels using the available training and validation masks. We then use this value during inference to avoid spurious false positives.

\miniheadline{Ablation Studies}
% - - - - - - - - - - - - - - - - - 
\ablationResults
In order to evaluate the contribution of
$(i)$~using the medoid instead of the centroid in \EmbedSeg, and
$(ii)$~employing test-time augmentation, 
we have performed the respective ablation studies and report the results  on  two 2D and one 3D dataset in Table~\ref{tab:ablation}.
