\section{Details on Training \EmbedSeg in 2D and 3D}
\label{sec:extended-approach}

The goal of instance segmentation is to cluster a set of pixels $\vec{X}= \{ \vec{x}_{1} \ldots \vec{x}_{i} \ldots \vec{x}_{N} \}$, (where $\vec{x} \in \mathcal{R}^{D}$, with $D$ being the dimensionality of the given input images), into a set of segmented object instances $S=\{ {S_{1} \ldots S_{k} \ldots S_{K}} \}$.

This is achieved by learning an offset vector $\vec{o}_{i}$ for each pixel $\vec{x}_{i}$, so that the resulting (spatial) embedding $\vec{e}_{i}=\vec{x}_{i}+\vec{o}_{i}$ points to its corresponding object center (instance center) $\vec{C}_{k}$.
Here, $\vec{o}_{i}$, $\vec{e}_{i}$ and $\vec{C}_{k}$ are in $\mathcal{R}^{D}$.

In order to do so, we propose to use a Gaussian function $\phi_{k}$ for each object $S_{k}$, which converts the distance between a (spatial) pixel embedding $\vec{e}_{i}$ and the instance center $\vec{C}_{k}$ into a probability of belonging to that object

\begin{equation}
\phi_{k} \left( \vec{e}_{i} \right) = \exp \left( -\Big\lVert \frac{ \left( \vec{e}_{i} - \vec{C}_{k} \right)^{T} \mathlarger{\mathlarger{\vec{\Sigma}}}_{k}^{-1} \left( \vec{e}_{i} - \vec{C}_{k} \right)}{2} \Big\rVert \right).   \label{eqgauss}
\end{equation}

A high probability signifies that the pixel embedding $\vec{e}_{i}$ is close to the instance center  $\vec{C}_{k}$ and the corresponding pixel is likely to belong to the object $S_k$, while a low probability means that the pixel is more likely to belong to the background (or another object). 
More specifically, if $\phi_{k}(\vec{e}_{i})>0.5,$ the pixel at location $\vec{x}_{i}$ will be assigned to the object $S_k$. 
Here, $\mathlarger{\mathlarger{\vec{\Sigma}_{k}}} \in \mathcal{R}^{D \times D}$ is the diagonal covariance matrix representing the cluster bandwidth for object $S_k$. 
The corresponding standard deviation vector for object $S_k$ is indicated as $\vec{\sigma}_{k} \in \mathcal{R}^{D}$ whose entries along the $d^{\text{th}}$ dimension are denoted as $\sigma_{k, d}$. 
For example, for D = 3, 

\begin{equation}
    \vec{\Sigma}_{k} = \begin{bmatrix}
    \sigma^{2}_{k, 1} & 0 & 0 \\
    0 & \sigma^{2}_{k, 2} & 0 \\
    0 & 0 & \sigma^{2}_{k, 3}
    \end{bmatrix}.
\end{equation}

In order to allow larger objects to predict a larger and similarly, smaller objects to predict a smaller $\mathlarger{\mathlarger{\vec{\Sigma_{k}}}}$, we let each pixel $\vec{x}_{i}$ of object $k$ individually predict a $\vec{\sigma_{i}}$ and compute the corresponding $\vec{\sigma_{k}}$ for the constituting object as the mean of all predicted $\vec{\sigma_{i}}$ for that object

\begin{equation}
    \vec{\sigma_{k}} = \frac{1}{\vert S_{k} \vert} \mathlarger{\sum}_{\vec{\sigma_i} \in S_{k}} \vec{\sigma_{i}}.
\end{equation}

By comparing the predicted $\phi_{k}$ of object to the ground truth foreground mask $S_{k}$, we compute the differentiable Lov\'asz-Softmax loss $L_{\text{IoU}}$~\cite{berman2018, yu2015}.

There is still the question of deducing the centre of attraction of an object, at inference time, so as to look for pixel embeddings which fall in a \emph{margin} around it. For this purpose, we also let each pixel predict a \emph{seediness} score which indicates how likely it is to be the centre of attraction. The seediness score should actually be similar to the output of the gaussian function in Equation \eqref{eqgauss}. So we can construct a loss function 
\begin{equation}
    L_{\text{seed}}=\frac{1}{N} \sum_{i=1}^{N}  w_{\text{fg}}\mathds{1}_{\{s_i \in S_{k}\}} \lVert s_{i} - \phi_{k} (\vec{e}_{i}) \rVert^{2} + w_{\text{bg}}\mathds{1}_{\{s_i \not\in S_{\text{fg}}\}} \lVert s_{i} - 0 \rVert^{2},
    \label{eqseed}
\end{equation}
which allows minimizing the distance between the output of the gaussian function corresponding to any pixel and the predicted seediness score, arising from that pixel. The seediness score for the background pixels are regressed to 0.
Furthermore, to ensure that at inference, while sampling highly seeded pixels, $\vec{\sigma}_{k} \approx \hat{\vec{\sigma}}_{k}$, we include a smoothness loss

\begin{equation}
    L_{\text{var}}= \frac{1}{\vert S_{k} \vert}\sum_{\vec{\sigma_i} \in S_{k}} \lVert \vec{\sigma}_{i} - \vec{\sigma}_{k} \rVert^{2}.
    \label{eqvar}
\end{equation}

The complete loss function is then computed as the weighted sum
\begin{equation}
L = w_{\text{seed}} L_{\text{seed}} + w_{\text{IoU}} L_{\text{IoU}} + w_{\text{var}} L_{\text{var}}.
\end{equation}

We use $w_{\text{seed}} = 1$, $w_{\text{iou}} = 1$ and $w_{\text{var}} = 10$ for all 2D and 3D experiments. 
For all 2D experiments, we additionally set $w_{\text{fg}}$ and $w_{\text{bg}}$ to $10$ and $1$, respectively.
For all 3D experiments, $w_{\text{fg}}$ was instead set to the ratio of the number of background and foreground pixels in training and validation data.
More details can be found in~\cite{neven2019} and in our open source implementation at \url{https://github.com/juglab/EmbedSeg}.



\newpage