\section{Related Work and Proposed Method}
\label{sec:approach}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\figQualitative
Embedding-based segmentation methods have recently emerged in the context of multiperson pose estimation. 
\cite{newell2017} initially suggested a DL framework where each pixel predicts a \emph{tag} or \textit{embedding}. 
The proposed objective encourages pairs of tags to have similar values if and only if the corresponding pixels belonged to the same object.
In the same year, \cite{brabandere2017} suggested a specific hinge-loss which lead to improved clustering during inference, \ie they propose to penalize close proximity of the mean embedding of different objects. 
\cite{novotny2018} later showed that constructing dense pixel embeddings to separate objects is not possible with a fully convolutional setup.

\EmbedSeg uses a branched ERF-Net~\cite{romera2018,neven2019}, such that each pixel $\vec{x}_i\in S_k$, in an object instance with label $k$, is trained to predict
$(i)$~an offset vector $\vec{o}_i$ that embeds $\vec{x}_i$ to $\vec{e}_i = \vec{x}_i + \vec{o}_i$, ideally coinciding with a uniquely defined embedding location $\vec{e}_{i}^{k}$ for the ground truth mask $S_k$,
$(ii)$~an uncertainty vector $\vec{\sigma_i}$ that estimates the error of $\vec{e}_i$ \wrt $\vec{e}_{i}^{k}$, and
$(iii)$~a \textit{seediness score} $s_i$ that expresses the likelihood that this pixel coincides with $\vec{e}_{i}^{k}$.
Interestingly, the loss terms that enable the training of these predicted values also ensures that the IoU of $S_k$ and the predicted instance segmentation is maximized. 
Additional details are provided in Appendix \ref{sec:extended-approach}.

Once trained, the following inference scheme is used to find object instances (see Appendix~\ref{sec:extended-approach-2} for more details):
$(i)$~we collect all pixels with a seediness score $s_i > s_{\text{fg}}$ in a set of foreground pixels $S_{\text{fg}}$,
$(ii)$~from all pixels in $S_{\text{fg}}$, we pick  $\vec{x}_\text{seed}$, the pixel with the highest seediness score $s_i > s_{\text{min}}$,
$(iii)$~if such a $\vec{x}_\text{seed}$ exists, we collect all foreground pixels in $S_{fg}$ that embed themselves at a location where the embedding likelihood defined by $\vec{e}_{\text{seed}}$ and $\vec{\sigma_{\text{seed}}}$ is $>0.5$.
Together, these pixels define a segmented instance $S_k$. Finally,
$(iv)$~we remove all pixels $S_k$ from $S_{\text{fg}}$ and jump to step two until no more valid seed pixels $\vec{x}_\text{seed}$ exist in $S_{\text{fg}}$.
In all our experiments we use $s_{\text{fg}} = 0.5$ and $s_{\text{min}}=0.9$.

While Neven~\etal either learn the desired embedding location during training or simply use the centroid, we argue that this is not the optimal choice when object shapes are more complex (\ie not star-convex). 
We reason that it is desirable to choose a point that minimizes the average distance to all pixels $\vec{x}_i \in S_k$, \ie the \textit{geometric median}~(GM).
Like the centroid, also the GM has the unfortunate property that it can lie outside of its defining object. 
Such object-external points are bad embedding points for two reasons:
$(i)$~the seediness score of such points will likely be very low, and
$(ii)$~multiple such points might fall very close to each other in crowded image regions.
Hence, we propose to use the \textit{medoid} instead.
The medoid pixel of the object instance $S_k$ is the one pixel of the object with the smallest average distance to all other pixels \ie  $\vec{x}_\text{medoid}(S_k) = \argmin_{\vec{y} \in S_k} \frac{1}{\vert S_{k} \vert} \sum_{\vec{x} \in S_k} \lVert \vec{x},\vec{y} \rVert_2$.
% \end{equation}

During prediction we use $8$-fold and $16$-fold test-time augmentation in 2D and 3D, respectively~\cite{zeng2017, wang2019} where the evaluation  images are transformed through axis-aligned rotations and flips, their corresponding predictions are back transformed and averaged.