\renewcommand{\thesection}{\Alph{section}}
\renewcommand{\thesubsection}{\Alph{subsection}}
\section*{Supplementary Material}
\subsection{Additional Details}
%%INTO APPENDIX:
\subsubsection{Training}\label{app:model_trainig}
HoVer-NeXt is trained for 200000 steps with batch size 48 using AdamW optimizer with weight decay (0.0001). We use a cosine-annealing learning rate schedule from 1e-4 to 1e-8.
For the encoder, we rely on imagenet pretrained ConvNeXt-v2 encoders from pytorch-image-models \cite{rw2019timm}. All encoders are trained with 50\% dropout and decoder arms do not utilize dropout.
As loss functions, the two arms for instance and semantic segmentation have separate specific losses. For the instance arm, the center point vector predictions are trained with MSELoss and the BCB map with cross entropy loss. The class prediction arm uses Focal Loss (gamma=2.0) and class and instance arm losses are summed and weighted using a weighting parameter (lambda = 0.02). Model selection is done via best validation metrics specific to the dataset instead of lowest validation loss.
Based on recommendations from \citet{Tellez2019QuantifyingTE}, we apply HED color augmentation, hue,saturation and brightness variation, random noise and Gaussian blurring. We also include random rotation, flipping, mirroring, zoom, scale, shear, translate and elastic transform.
Post-processing during training for the validation step is done as explained in the inference pipeline section~\ref{sec:inferencepipe} except directly on tiles.

%% PARTIALLY INTO APPENDIX:
\subsubsection{Resolving overlaps}\label{app:resolving_overlaps}
A single worker stitches the ROIs to form the final output and resolves overlaps whenever there are nuclei in both the write space and the newly to be written ROI.
Each side of the ROI is checked within 512px overlap regions to resolve duplicate instances or half-instances. All instances within the outermost quarter of the already written region will be kept as is and new instances in that area are discarded. Also if part of an old nucleus exists in the second quarter, it is kept as well and any information from the new ROI is discarded. Any other already written nuclei will be deleted and replaced by the new predictions from the second quarter onwards. Instance ID's are updated based on the largest previously written instance, but the instance numbering may not be contiguous. As the tiles are from the same inference process, there will be no differences in class assignments and this method will only be problematic if an instance is larger than the overlap region, but this is not the case in the investigated datasets and domains. 

\subsubsection{Foreground Background Estimation}\label{app:fgbg_est}
\textcolor{cr}{We estimate the foreground of whole slide images on the thumbnail of the WSI which is available via OpenSlide in all common WSI formats. The thumbnail size is dependent on the WSI and Scanner and ranges from $1/75$ to $1/160$ %TODO, maybe we can make this more robust by also using the lowest level?
of the full resolution image and the final mask information is rescaled to the required image size depending on the retrieved tile magnification. The thumbnail is first converted to gray scale using the conversion matrix of OpenCV and subsequently blurred with a $5\times5$ averaging kernel to avoid high frequencies and noise. We set an intensity threshold of 240 and keep all pixels below that threshold forming one ore multiple foreground regions. Then we filter the foreground regions by removing objects that are smaller than 0.01\% of the image and finally expand all kept regions with a dilation step using a circular kernel. The size of the kernel is again determined by the image dimensions as we use the 0.01\% of the size of the longest dimension as diameter.
The filtering step ensures that we do not keep small fragments and small slide artifacts as relevant foreground and the dilation step avoids cut corners where some lighter tissue would be missing due to the blurring step. These threshold steps were chosen qualitatively by considering WSI from multiple cohorts and verifying that estimated foreground area was within reasonable bounds.}

\subsubsection{On the potential negative effects of test-time augmentations}\label{app:negative_aug}
\textcolor{cr}{Applied augmentation methods were selected during the original challenge submission with all transformation parameters being chosen such that the transformed images still appear as though they could be crops form an H\&E image. However, some of these transformations remove information or make it more challenging for the model to make a correct prediction such as adding Gaussian noise or blurring the image. Therefore these transformations are removed from the set of augmentation methods during inference, but during training it is useful for the model to also learn to produce acceptable results even if the image is blurry. Including during training are also elastic deformation, rotation in a range from $0^{\circ}-45^{\circ}$, shearing, as well as zooming, all of which utilize an interpolation method to transform the image, thereby changing image information. The same then applies for the model outputs, which need to be inversely transformed, where then any differences introduced by the interpolation will lead to less exact nucleus boundary predictions. Additionally, rotating by $45^{\circ}$ or shifting the image removes pixels completely which also leads unnecessarily worse performance. Figure~\ref{fig:app:aug_issues} illustrates these concepts.}
\newpage
\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{Augmentation_Issues.pdf}
    \caption{\textcolor{cr}{Example of a potentially problematic augmentation method: The input image is rotated by less than 90\textdegree which means that an interpolation method needs to be used. The model then receives the transformed image, provides an output but naturally cannot provide an output for the now invisible area. When rotating the image back to its original orientation, a large area of the original input is actually missed. Zooming into a detail of the rotation of the model output, we can quickly observe differences where the interpolation method softens some edges and reduces differences between neighboring pixel values. While this does not necessarily have to have a negative impact on the final result, differences of a single pixel can already change the hausdorff distance for this nucleus significantly.}}
    \label{fig:app:aug_issues}
\end{figure}

\subsubsection{Lizard and PanNuke description}\label{app:liz_pan}
Both datasets are available at the \href{https://warwick.ac.uk/fac/cross_fac/tia/data/}{TIA-Warwick website}
\paragraph{Lizard}
The Lizard dataset is an H\&E based nuclei segmentation and classification dataset for CRC and normal colon tissue with six classes: neutrophils, epithelial cells, lymphocytes, plasma cells, eosinophils, and connective tissue cells \cite{Graham2021LizardAL}. Raw H\&E images are available at 0.5mpp both as pre-cropped (with overlap) $256\times256$ tiles as well as full ROIs. It combines multiple datasets from several institutes and has 495,179 total annotated nuclei but is highly imbalanced with neutrophils and eosinophils only accounting for 0.89\% and 0.68\% of all instances. Additionally, $\sim$84\% of the dataset is background. The dataset combines multiple datasets from different institutes, one of which is the GlaS subset which we are using as an external test set.

\paragraph{PanNuke}
PanNuke is another H\&E nuclei panoptic segmentation dataset but with a wider focus on samples from many different cancers \cite{gamper2019pannuke,gamper2020pannuke}. Here the classes are neoplastic cells, inflammatory cells, connective tissue cells, dead cells, and non-neoplastic epithelial cells, again with considerable class imbalance as well as tissue type imbalance. PanNuke is only available as $256\times256$px crops and only at $\sim$0.25mpp.

\subsubsection{Registration and ground truth preparation for the pHH3 mitosis dataset}\label{app:regphh3}
\textcolor{cr}{All whole slide images are converted to TIFFs. Then, each H\&E - pHH3 pair is registered in its entirety by manually specifying an anchor point in an exactly matching tissue area (e.g. by selecting a nucleus that was clearly observable in both images) to remove any offset differences in the images. Afterwards, we use SimpleElastix to estimate rigid and non-rigid registration transforms using greyscale versions of the images down-sampled to 0.5mpp and apply these transforms to the whole slide images at full resolution. All registrations were performed on a machine with 64Cores and 512GB memory. 
Thresholds for ground truth masks from the DAB channel are set individually per ROI to account for intra- and inter-WSI differences. ROIs were deliberately selected to include a large area of potential mitoses and areas problematic for the pHH3 stain such as necrotic areas or clear stain artifacts were avoided. All registered images, ROIs and generated masks were verified qualitatively to ensure that accurate segmentation masks were generated.}

\subsubsection{Self-training for mitosis}\label{app:self_train}

First, we train five models on just the lizard dataset with 5 cross validation folds and run ensemble inference on the the new mitosis crops. Then, a new model is trained on the combined dataset and checkpoints are saved every 50000 steps. Based on relative per sample changes in mean panoptic quality from the first checkpoint to the best (considering validation metrics) checkpoint, samples are split into easy and hard samples. Easy samples would be the ones with less than median change in panoptic quality and hard samples those with more.
The same model is then re-trained from scratch only on the easy samples from the mitosis dataset and used to predict the hard samples creating the final mitosis dataset. Mitosis ground truth from the restain are always the only "nuclei" of the mitosis class and mitosis predictions are re-classified to the second most likely class.
A model trained on only the mitosis dataset is then used to predict mitoses on Lizard where new mitosis annotations are only added if there is no other label on any of the pixels yet.

\subsubsection{Additional Validation: MitEval and EosEval}\label{app:additonal_validation}
% TODO, write a proper description and put images.
\paragraph{Mitosis Evaluation}
For this dataset, we specify \textcolor{cr}{13 ROIS from nine randomly selected CRC} H\&E resection WSI to ensure that each mitosis can actually be observed on the H\&E which is not guaranteed when using \textcolor{cr}{automatic label generation from} pHH3-based restains. pHH3 is also positive for cells in G2 and some other objects also sometimes pick up the antibody. 
\textcolor{cr}{Nine ROIs are on five slides from an internal cohort (0.12mpp) and four ROIs are on four publicly available TCGA Slides (0.25mpp). Annotations were done as small ellipses around the mitotic figures by three pathologists. The final dataset matches annotations by a maximum distance of 6$\mu m$ and similar to \citet{Aubreville2023ACM}, annotations with at least one matching additional annotation are added to the dataset, however we do not include an additional review step. The three observers have an ICC3 of 0.860 [0.69,0.95].} Images at $\sim$0.5mpp are provided both pre-tiled, and as complete ROIs in npy format.

\paragraph{Eosinophil Evaluation}
Eosinophils are a comparatively easy to spot subset of immune cells discernible on H\&E, yet in the lizard dataset, and during the CoNiC challenge, none of the models perform well in detecting them. Therefore, we created an additional eosinophil point annotation dataset with 11 ROIs of varied sizes from 8 Patients to further evaluate eosinophil detection performance, in particular also across different stain variations. ROI raw H\&E images (at $\sim$0.5mpp) and annotations are provided as individual npy files.

\subsubsection{Specifications for Inference Time comparison}\label{app:inf_specs}
All models run on a HPC Node with 2xIntel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, (20 cores each), 1600GB RAM with one NVIDIA A40 48GB GDDR6. We set the following parameters for CellViT, HoVer-Net and HoVer-NeXt, unspecified parameters are left to default. \textcolor{cr}{In our HPC environment, we need to set OMP\_NUM\_THREADS to 16 (matching the number of workers) for the PyTorch HoVer-Net pipeline (\href{https://github.com/vqdang/hover_net}{github.com/vqdang/hover\_net}) as we otherwise could not achieve competitive speeds.} For CellViT, all TCGA-AA* images had to be ran using the 20$\times$ parameter and the others with the 40$\times$ parameter set. While the images indeed have different magnifications stored, their resolution is the same ($\sim$0.25mpp). \textcolor{cr}{Moreover, the CellViT Pipeline requires pre-processing WSI which we included into the total processing time as the other two pipelines do this on the fly.} All experiments are run on only 16 cores. \textcolor{cr}{For future evaluation, we also specify the commit and repository used for the comparison: \newline 
HoVer-Net (\href{https://github.com/vqdang/hover_net/tree/67e2ce5e3f1a64a2ece77ad1c24233653a9e0901}{67e2ce5e3f1a64a2ece77ad1c24233653a9e0901})\newline
CellViT (\href{https://github.com/TIO-IKIM/CellViT/tree/4bc42811c9841805ef0984b3ec0daf159312323a}{4bc42811c9841805ef0984b3ec0daf159312323a})
\newline HoVer-NeXt (\href{https://github.com/digitalpathologybern/hover_next_inference/tree/c5bf99fdc2d8bd5129d780c5f19ee83a4babb0d4}{c5bf99fdc2d8bd5129d780c5f19ee83a4babb0d4}).}

\begin{table}[htbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {app:tab:parameters_speed}%
  {}%
  {\begin{tabular}{l|l|l}
  HoVer-Net & CellViT & HoVer-NeXt\\ \hline
  --batch\_size=64 & --batch\_size \textcolor{cr}{30} & --batch\_size=64\\
  --model\_mode=fast & --gpu 0 & --tta 4 \\
  --nr\_inference\_workers=\textcolor{cr}{16} & --magnification \textcolor{cr}{20/40} & --inf\_workers 16\\
  --nr\_post\_proc\_workers=\textcolor{cr}{16} & --geojson & --pp\_workers 16\\
  & \textcolor{cr}{--enforce\_amp}& --overlap 0.9375 \\  
  \end{tabular}}
\end{table}

\newpage
\subsubsection{Inference time comparison slides (TCGA)}\label{app:tcga_speed_cases}

\begin{table}[!ht]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab:app:tcga_speed_cases}%
  {Selected sample Images for inference speed comparisons. Images were selected based on the identified foreground area and limited artefacts on the slide. They are supposed to represent different tissue sizes from 20mm$^2$ to 500mm$^2$ thereby being examples of realistic applications from biopsy to resection blocks. All Images are from the TCGA-COAD/READ cohorts and can be obtained from \url{https://portal.gdc.cancer.gov/}. Foreground estimates are computed using our internal foreground estimation pipeline, \textcolor{cr}{the internal FGBG estimation of HoVer-Net computes smaller foreground areas, in particular for the larger wsi.}}%
  {\begin{tabular}{c|c|c|l|l}
    Case ID & Slide ID & mpp & est. fg size & \textcolor{cr}{HoVer-Net est. fg.} \\ \hline
    TCGA-AA-3977 & DX1 & 0.2325 & 20.16mm$^2$  &\textcolor{cr}{ 19.85mm$^2$ }\\
    TCGA-AA-3688 & DX1 & 0.2325 & 49.84mm$^2$  &\textcolor{cr}{ 46.74mm$^2$ }\\
    TCGA-AA-A010 & DX1 & 0.2325 & 101.09mm$^2$ &\textcolor{cr}{ 99.97mm$^2$ }\\
    TCGA-CK-4951 & DX1 & 0.2520 & 202.90mm$^2$ &\textcolor{cr}{ 137.46mm$^2$} \\
    TCGA-5M-AAT5 & DX1 & 0.2525 & 501.00mm$^2$ &\textcolor{cr}{ 421.18mm$^2$} \\
    \end{tabular}}
\end{table}

\newpage
\subsection{Additional Figures}
\subsubsection{Tile artefacts on HN-CoNIC compared to HN-Large}
\begin{figure}[!ht]
    \centering
    \includegraphics[width=\linewidth]{comp_conic_hovernext_ln.pdf}
    \caption{In the images, we can see the same region from a cancer-free lymph-node with predictions from HN$_{CoNIC}$ and HN$_{Large,TTA=4}$. We note that either way, the epithelial predictions are wrong, however we highlight the reduction in tile based processing artefacts and overall reduction in false positive epithelium predictions. These tile based artefacts occur mostly if the tile normalization metrics transform the tile in way unseen during training. Most of the tiles in the training set do not really contain background, tiles with lymphocytes rarely show a germinal center and other strong color expressions such as ink or blood are also absent. Therefore, we recommend a constant normalization for 8bit RGB images both during training as well as during inference }
    \label{fig:app:ln_comp}
\end{figure}
\newpage
\subsubsection{Qualitative comparison of HoVer-Net, CellViT, and HoVer-NeXt}

\begin{figure}[!ht]
    \centering
    \includegraphics[width=\linewidth]{comp_hn_cellvit.pdf}
    \caption{Qualitative comparison of HoVer-Net, CellViT, and HoVer-NeXt. Here we only consider detections, an not segmentations. For CellViT observe the same square patterns as we saw with HN-CoNiC (App. Figure~\ref{fig:app:ln_comp} and an tertiary lymphoid structures completely predicted as neoplastic epithelium. HoVer-Net filters the entire normal submucosa for processing and overpredicts neoplastic cells in general. HoVer-NeXt predicts some of the normal (perhaps hyperplastic) mucosa as normal epithelium, but also falsely classifies a lot of it as neoplastic. It is the only model that classifies the lymphoid aggregates mostly correctly, yet also misclassifies some vessels and histiocytes as neoplastic epithelium.}
    \label{fig:app:qualitativ_comp}
\end{figure}

\newpage

\subsection{Additional Tables}

\subsubsection{Augmentation parameters}\label{app:aug_params}
\begin{table}[!ht]
\floatconts
  {tab:augmentation}%
  {\textcolor{cr}{Parameters for applied augmentation methods during training and for test-time augmentations. Color augmentations are performed in part with custom functions written in pytorch and are defined by a single scaling factor that adjusts all parameters, however we report the individual scaled values. The HED augmentation method is adapted from \citet{Tellez2019QuantifyingTE}. Spatial augmentations rely on a custom augmentation module written entirely using pytorch functions. All augmentation methods run on GPU.}}%
  {\begin{tabular}{c|c|c|c|c}
    Method & Train Parameters &p(Train)& Test Parameters &p(Test)\\ \hline
    Color Aug. &  & & & \\ \hline
    Color Jitter & B,C,S,H=[0.32,0.32,0.2,0.08] & 0.3 & - & - \\
    HED Aug. & $\sigma$=0.03, bias=0.05 & 0.75 & $\sigma$=0.03, bias=0.05 & 1.0 \\
    Gaussian Noise & $\sigma$=0.05 & 0.3 & - & -\\
    Gaussian Blur & size=15, $\sigma$=(0.1,2.0) & 2.0 & - & - \\ \hline
    Spatial Aug. & & & & \\ \hline
    Mirror & H(p=0.5),V(p=0.5) & 0.5 & H(p=0.5),V(p=0.5) & 0.5 \\
    Translate & Max pct.=0.05 & 0.2 & - & - \\
    Scale & Min=0.8,Max=1.2 & 0.2 & - & -\\
    Zoom & Min=0.5,Max=1.5 & 0.2 & - & -\\
    Rotate & Max deg.=179\textdegree & 0.75 & Only 90\textdegree & 0.75 \\
    Shear & Max pct.=0.1 & 0.2 & - & - \\
    Elastic & $\alpha$ = (120,120), $\sigma$=8 & 0.5 & - & - \\ \hline
  \end{tabular}}
\end{table}
\newpage
\subsubsection{Ablation Study: Sampling vs. Weighting}\label{app:ablation}
\begin{table}[!ht]
    \centering
    \begin{tabular}{c|c|l|c|c|c|c|c|c|c|c}
         LW & DS & Metric &  & & & & & \\ \hline
         & & Bal. Acc. &  & mAcc. & Neu & Epi & Lym & Pla & Eos & Con \\ \hline
         \checkmark & \checkmark & & & 0.762 & \textbf{0.647} & \textbf{0.798} & 0.857 & \textbf{0.726} & 0.737 & 0.803 \\
                  & \checkmark & & & 0.759 & 0.627 & 0.796 & \textbf{0.858} & 0.693 & \textbf{0.785} & 0.794 \\ 
         \checkmark &          & & & \textbf{0.766} & 0.641 & 0.789 & 0.855 & 0.716 & 0.772 & \textbf{0.816} \\
         &                   & & & 0.755 & 0.617 & 0.789 & 0.847 & 0.697 & 0.776 & 0.803 \\ \hline
         & & HD & & & Neu & Epi & Lym & Pla & Eos & Con \\ \hline
         \checkmark & \checkmark & & & & \textbf{2.109} & 2.856 & 1.168 & 1.205 & 2.239 & 2.032 \\
                  & \checkmark & & & & 2.250 & 2.843 & 1.165 & \textbf{1.196} & \textbf{2.149} & \textbf{2.003} \\ 
         \checkmark &          & & & & 2.283 & \textbf{2.821} & \textbf{1.161} & 1.211 & 2.255 & 2.027 \\
                  &          & & & & 2.543 & 2.980 & 1.327 & 1.416 & 2.477 & 2.137 \\ \hline
         & & F1 & bF1 & mF1 & Neu & Epi & Lym & Pla & Eos & Con \\ \hline
         \checkmark & \checkmark & & 0.844 & \textbf{0.607} & 0.293 & 0.829 & 0.765 & 0.493 & 0.544 & 0.718 \\
          & \checkmark & & 0.841 & 0.606 & \textbf{0.313} & 0.826 & 0.766 & 0.471 & \textbf{0.553} & 0.708 \\ 
          \checkmark & & & \textbf{0.846} & 0.605 & 0.254 &\textbf{0.830} & \textbf{0.767} & \textbf{0.501} & 0.551 & \textbf{0.729} \\   
         &           & & 0.836 & 0.571 & 0.196 & 0.820 & 0.749 & 0.442 & 0.506 & 0.713 \\ \hline
         & & PQ & bPQ & mPQ & Neu & Epi & Lym & Pla & Eos & Con \\ \hline
         \checkmark & \checkmark & & 0.538 & 0.453 & 0.197 & 0.608 & 0.638 & 0.404 & 0.353 & 0.517 \\
                  & \checkmark & & \textbf{0.546} & \textbf{0.454} & \textbf{0.206} & 0.606 & \textbf{0.644} & 0.386 & \textbf{0.369} & 0.516 \\ 
         \checkmark &          & & 0.543 & 0.452 & 0.161 & \textbf{0.611} & 0.642 & \textbf{0.411} & 0.359 & \textbf{0.526} \\
                  &          & & 0.518 & 0.414 & 0.119 & 0.593 & 0.607 & 0.346 & 0.312 & 0.506 \\ \hline
        & & Binary Px. & bF1 & bMCC & & & & & & \\ \hline
        \checkmark & \checkmark & & \textbf{0.821} & 0.784 & & & & & & \\
                 & \checkmark & & 0.819 & \textbf{0.786} & & & & & & \\
        \checkmark &          & & 0.821 & 0.783 & & & & & & \\
                 &          & & 0.814 & 0.776 & & & & & & \\
                 
    \end{tabular}
    \caption{\textcolor{cr}{Ablation Study: Class based loss weighting (LW) (Using the same focal loss with esimated class weights from \citet{rumberger2022panoptic}) and class distribution based data sampling (DS) in comparison. HN$_{Large}$ with 16TTA, fixed seed, 10 run average. If two values in this table are the same, the omitted decimals are used for deciding which is best.}}
    \label{tab:ablation}
\end{table}
\newpage

\subsubsection{MitEval full metrics}\label{app:tab:mit}
To evaluate this dataset, we use the same distance based matching using the centroid of the annotated ellipse.
\begin{table}[!ht]
\floatconts
  {tab:mit}%
  {Mitosis detection}%
  {\begin{tabular}{c|c|c|c}
    & Precision & Recall & F1 \\ \hline
HN$_{Large}$ & 0.564125 & 0.680488 & 0.616874 \\
HN$_{Base}$ & 0.527298 & 0.670855 & 0.590480 \\ 
HN$_{Tiny}$ & 0.545022 & 0.720167 & 0.620478 \\
  \end{tabular}}
\end{table}


\subsubsection{EosEval full metrics}\label{app:tab:eos}

As the eosinophil test-set only consists of point annotations, we cannot compare any segmentation metrics and only report detection measures. Results are averaged (std) over 7 patients with 11 different ROIs in total

\begin{table}[!ht]
\floatconts
  {tab:eos}%
  {Eosinophil detection:}%
  {\begin{tabular}{c|c|c|c}
         & Precision & Recall & F1 \\ \hline
         Large & 0.699 +- 0.120 & 0.695 +- 0.061 & 0.688 +- 0.055 \\
         Base & 0.696 +- 0.128 & 0.687 +- 0.073 & 0.681 +- 0.063 \\
         Tiny & 0.641 +- 0.155 & 0.700 +- 0.072 & 0.654 +- 0.083 \\
    \end{tabular}}
\end{table}

\subsubsection{PanNuke: Additional Results}\label{app:additional_pan}
All results are averaged over the three official folds without center-cropping. In practice, results are most likely better.
\begin{table}[!ht]
\floatconts
  {tab:app:pixel_metrics_pannuke}%
  {}%
  {\begin{tabular}{l|c|c}
  Binary px. metrics \\ 
    & F1 & MCC \\ \hline
HN$_{Tiny,4TTA}$ & 0.802 & 0.810 \\
HN$_{Tiny,16TTA}$ & 0.803 & 0.811 \\ 
  \end{tabular}}
\end{table}
\begin{table}[!ht]
\floatconts
  {tab:app:f1_pannuke}%
  {}%
  {\begin{tabular}{l|c|c||c|c|c|c|c}
  F1 Score \\ 
    & bF1 & mF1 & F1$_{Neo}$ &F1$_{Epi}$ & F1$_{Inf}$ & F1$_{Con}$ & F1$_{Dead}$ \\ \hline
HN$_{Tiny,4TTA}$ & 0.822 & 0.649 & 0.715 & 0.723 & 0.679 & 0.641 & 0.486 \\
HN$_{Tiny,16TTA}$ & 0.826 & 0.653 & 0.720 & 0.728 & 0.681 & 0.646 & 0.492\\ 
  \end{tabular}}
\end{table}
\begin{table}[!ht]
\floatconts
  {tab:app:hd_pannuke}%
  {}%
  {\begin{tabular}{l|c|c|c|c|c}
  Hausdorff Distance \\ 
     &Neo & Epi & Inf& Con& Dead \\ \hline
HN$_{Tiny,4TTA}$ & 5.683&5.958&3.729&5.132&3.570 \\
HN$_{Tiny,16TTA}$ & 5.622&5.918&3.717&5.090&3.551\\ 
  \end{tabular}}
\end{table}
\begin{table}[!ht]
\floatconts
  {tab:app:bacc_pannuke}%
  {}%
  {\begin{tabular}{l|c||c|c|c|c|c}
  Balanced Accuracy \\ 
    & mAcc & Neo & Epi & Inf& Con& Dead \\ \hline
HN$_{Tiny,4TTA}$ & 0.779 &0.760 & 0.852 &0.813 &0.748 & 0.723
 \\
HN$_{Tiny,16TTA}$ & 0.782 & 0.764 & 0.854 & 0.814 & 0.751 & 0.725
\\ 
  \end{tabular}}
\end{table}

\begin{table}[!ht]
\floatconts
  {tab:app:tiss_metrics}%
  {}%
  {\begin{tabular}{l|l|l||l|l|}
  Tissue Average & bPQ & & mPQ &  \\
    & HN$_{Tiny,4TTA}$ & HN$_{Tiny,16TTA}$ & HN$_{Tiny,4TTA}$ & HN$_{Tiny,16TTA}$ \\ \hline
Adrenal gland & 0.702089 & 0.703924 & 0.49439 & 0.494484 \\ 
Bile-duct & 0.665283 & 0.667678 & 0.465224 & 0.467801 \\ 
Bladder & 0.693248 & 0.696374 & 0.575255 & 0.578479 \\ 
Breast & 0.640457 & 0.643159 & 0.493503 & 0.49549 \\ 
Cervix & 0.665073 & 0.666972 & 0.474109 & 0.47509 \\ 
Colon & 0.566692 & 0.570241 & 0.425545 & 0.428342 \\ 
Esophagus & 0.644828 & 0.64745 & 0.524058 & 0.52689 \\ 
Head\&Neck & 0.641031 & 0.643037 & 0.481729 & 0.484619 \\ 
Kidney & 0.6809 & 0.683341 & 0.512658 & 0.51673 \\ 
Liver & 0.715341 & 0.716678 & 0.501918 & 0.504076 \\ 
Lung & 0.630222 & 0.634101 & 0.425785 & 0.428984 \\ 
Ovarian & 0.608334 & 0.611863 & 0.483388 & 0.485762 \\ 
Pancreatic & 0.655729 & 0.657374 & 0.45788 & 0.460296 \\ 
Prostate & 0.62628 & 0.628754 & 0.480863 & 0.480669 \\ 
Skin & 0.620624 & 0.622956 & 0.410657 & 0.414369 \\ 
Stomach & 0.694477 & 0.696453 & 0.458618 & 0.461314 \\ 
Testis & 0.678664 & 0.679845 & 0.497335 & 0.49749 \\ 
Thyroid & 0.675996 & 0.67747 & 0.420411 & 0.422295 \\ 
Uterus & 0.617216 & 0.618833 & 0.44565 & 0.446299 \\ 
  \end{tabular}}
\end{table}