\section{Experiment Setup}

\subsection{Experiment pipeline}
\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{imgs/study-flowchart.drawio.pdf}
    \caption{Overview of our study design.
        We systematically investigate combinations of different sampling strategies, sample weightings, and  loss functions (including class-imbalance-aware loss functions).}
    \label{fig:overview}
\end{figure}
An overview of our experimental design is shown in Figure~\ref{fig:overview}.
We conduct our experiments on two diverse and imbalanced datasets.
We apply standard preprocessing and augmentation techniques.
Two sampling strategies are used to train a ResNet classifier~\cite{heDeepResidualLearning2016}.
We compare six loss functions, cross-entropy (CE), focal loss, soft F1, soft MCC, and CE with MCC and F1. 
For all experiments using the cross-entropy-loss we also compare sample weighting while sampling uniformly.

We share our dataset configurations and the code used for our study at \url{https://github.com/daniel-scholz/address-class-imbalance}.
For further details of our experiments implementation, refer to Appendix \ref{sec:training-setup}.

\subsection{Datasets}
To allow the reproduction of our results, our study only uses publicly available datasets.
%as detailed below.

\paragraph{Glioma} The first dataset comprises 3D MR images (T1w -/+ contrast, T2w, FLAIR) from large public datasets of adult patients with newly diagnosed gliomas, namely UCSF-PDGM~\cite{calabreseUniversityCaliforniaSan2022}, EGD~\cite{vandervoortErasmusGliomaDatabase2021}, and TCGA~\cite{bakasAdvancingCancerGenome2017}. 
Besides having all four imaging sequences outlined above available, we require biomarker testing for \textit{IDH} mutation and 1p/19q status in order to classify samples according to the 2021 WHO classification of brain tumors into (a) \textit{IDH} wildtype glioblastoma, (b) \textit{IDH} mutant and 1p/19q intact astrocytoma and (c) \textit{IDH} mutant and 1p/19q codeleted oligodendroglioma~\cite{wen2021WHOClassification2021}.
In total, our dataset contains preoperative MRIs of 1174 patients.
The prevalence of glioblastoma ($\sim$80\%) in comparison to oligodendroglioma ($\sim$8\%) and astrocytoma ($\sim$12\%) is striking and consistent throughout all the available datasets, mirroring the real-world distribution.
A visualization of the class distributions is shown in the Appendix~\ref{sec:dataset-distribution}, Figure~\ref{fig:dataset-distribution}.
We hold the TCGA dataset out for testing and use the remaining data for training.
For additional robustness analysis, we run each experiment with four different network initializations. 

\paragraph{Glaucoma} 
The second dataset consists of 1542 individual 2D RGB fundus photographs, of which 786 are healthy controls, 289 photographs show early glaucoma, and 467 are from advanced glaucoma patients~\cite{ahnDeepLearningModel2018}.
We randomly split the dataset into training ($\frac{3}{4}$) and testing ($\frac{1}{4}$) data, stratified by class: \{no, early, advanced\} glaucoma.