\section{Results}
\label{sec:results}
%We compare a macro-averaged, i.e., balanced, accuracy since we do not want to compare metrics that are among the used loss functions (MCC, F1 score).
% baseline
%We use uniform random sampling and cross-entropy loss as a deep-learning model baseline.

\subsection{Global classification results}
% for both datasets
Our main results on the test sets are shown in Table~\ref{tab:results-multi-bal-acc} (and Appendix~\ref{sec:further-results}, Table~\ref{tab:results-multi-f1},~\ref{tab:results-multi-AUC},~and~\ref{tab:results-multi-mcc}).
% sample weighting & oversampling
The baseline model (CE loss, uniform sampling) shows reasonable performance, which improves when adding sample weights or oversampling the minority classes for both datasets.
The most considerable improvement over the baseline qua balanced accuracy is achieved when using the CE + soft F1 loss combination and oversampling, with a relative improvement of 12.7\% on the glioma dataset.
We also observe smaller standard deviations for the loss combinations compared to the single class-imbalance-aware losses, F1 and MCC, indicating better training stability for the combination.
We explore this further in an ablation study (Appendix~\ref{sec:further-results}, Table~\ref{tab:ablation_low_batch_size}) for small batch size regimes.
The performance on the glaucoma dataset also improves most with the CE~+~F1 loss combination and oversampling~(+10.5\%).

% Accuracy and micro- and macro-averaged ROC AUC show less improvement since the classification performance for the majority class (glioblastoma) has the largest impact.
\input{tables/table_multi_bal_acc}

\subsection{Per-class analysis}

% relevance for both datasets
In addition, we perform a per-class analysis of our methods using the class-wise F1 score, which balances precision and recall (Table~\ref{tab:results-class}).
%This metric choice minimizes the influence of other classes on the evaluation since only samples of the observed class are considered.
The baseline training setup yields a classifier biased towards the majority class (glioblastoma / no glaucoma) while performing poorly on the minority classes.
% (astrocytoma, oligodendroglioma/early, advanced glaucoma).
The overall improvements in classification performance can be directly traced to improved minority class performance, since majority class performance stays constant across almost all experiments.
% glioma dataset
The CE~+~F1 loss tremendously improves classification performance on the astrocytoma minority class~(+20.3\%).
% glaucoma dataset
The largest improvement when using CE~+~F1 with oversampling loss is observed on the least prevalent early glaucoma (+51.3\%).
However, we also observe that using only a class imbalance-aware loss sometimes yields classifiers entirely ignoring one class (e.g., F1 loss in the glaucoma).
\input{tables/tables_classwise_f1}

%\include{tables/tables_classwise_f1_glioma}
%\include{tables/tables_classwise_f1_glaucoma}

\subsection{Visual feature space analysis}

To investigate the learned representations, we plot the features of the last ResNet layer for the glioma dataset. 
We use the popular t-distributed stochastic neighbor embedding (tSNE)~\cite{vandermaatenVisualizingDataUsing2008} to project the 256-dimensional feature vectors to 2D for visualization purposes (see Figure~\ref{fig:tsne}).
We observe that representations of the oligodendroglioma are often poorly clustered in this feature space, corresponding to the inferior performance in this class observed in Table~\ref{tab:results-class}.
CE~+~MCC loss with uniform sampling shows a better clustering of the oligodendroglioma features compared to the baseline or the MCC-only loss.
%The latter assigns the same cluster to glioblastoma and oligodendroglioma in this visualization producing distinct line-like clusters for glioblastoma and astrocytoma but fail to find good representations for the oligodendroglioma, resulting in an F1 score of zero.

\begin{figure}
    \centering
    % \includegraphics[width=\linewidth]{imgs/figure_5.png}
    \includegraphics[width=0.95\linewidth]{imgs/feature_vis_top3_multi_last.png}
    \caption{
        tSNE visualization of the feature representations before the last layer of the best and worst models (1st run) according to the F1 score ($\uparrow$) with class-wise F1 scores.}
    \label{fig:tsne}
\end{figure}