\section{Method}
\label{sec:methods}

\subsection{Class-imbalance-aware loss functions}
To mitigate the imbalanced data issue in medical imaging datasets, imbalance-aware loss functions emerge to enhance the performance of minority classes.
%, i.e., astrocytoma and oligodendroglioma.
These loss functions generally assign larger loss values to misclassified instances of these less prevalent classes.
This adaptation serves to rectify the disparity in their impact on the overall loss calculation.
We compare the focal loss with two loss functions derived from the MCC and the F1 score.
\subsubsection{Focal Loss}
Focal Loss~\cite{linFocalLossDense2017} is an often-used loss function for imbalanced deep-learning classification problems.
%, such as glioma classification.
Difficult-to-classify examples often stem from minority classes.
These examples are often predicted with low confidence, yielding higher loss values.
Hence, the deep learning model is incentivized to optimize for all classes equally.
%like astrocytoma and oligodendroglioma.
The loss function is given as
\begin{equation}
    \mathcal{L}_\text{focal}(p_t)=-(1-p_t)^\gamma \log(p_t)
\end{equation}
where the exponent $\gamma$ determines the strength of penalization for samples of class $t$ with predicted probability $p_t$.

\subsubsection{Soft F1 Loss}
The F1 score is a valuable metric for assessing classification performance since it summarizes precision and recall into a single number through the harmonic mean.
By macro-averaging the F1 score for all classes, we obtain a balanced assessment of the classifier.
We leverage this property by deriving a negative differentiable F1 score as a loss function.
To this end, we use differentiable true positives ($TP$), false positives ($FP$), and false negatives ($FN$):
%given as:
\begin{equation}
    TP =  \sum_{i\in I} y_i        \cdot     \hat{y_i}; \quad
    FP =  \sum_{i\in I} (1- y_i )  \cdot     \hat{y_i}; \quad 
    FN =  \sum_{i\in I} y_i        \cdot     (1-\hat{y_i})
\end{equation}

where $y_i$ is the label and $\hat{y_i}$ is the prediction for index $i$.
The precision, recall, and F1 score are defined as:
\begin{align}
\begin{split}
    \text{precision} & = \frac{TP}{TP+FP}; \quad
    \text{recall}    = \frac{TP}{TP+FN}                                                                     \\
    F1               & = \frac{2 \cdot \text{precision}\cdot\text{recall}}{ \text{precision} + \text{recall}}
\end{split}
\end{align}

We define the corresponding loss function as $\mathcal{L}_{F1} = 1 - F1_{\text{soft}}$, where $F1_{\text{soft}}$ is the macro average of the F1 score for each class using differentiable (\emph{soft}) $TP$, $FP$, and $FN$.
%The downside is, however, that each class-wise F1 score in the average does not consider the true negatives ($TN$).

\subsubsection{Soft MCC Loss}
Matthew's correlation coefficient (MCC)~\cite{matthewsComparisonPredictedObserved1975} is a metric that encompasses all four entries of the confusion matrix, namely $TP$, $TN$, $FP$, and $FN$, into a single value in the binary classification case.
It has been argued that the MCC is superior to many other metrics, such as accuracy, F1 score, and the receiver operating characteristic (ROC) area under the curve (AUC)~\cite{chiccoAdvantagesMatthewsCorrelation2020,chiccoMatthewsCorrelationCoefficient2021,chiccoMatthewsCorrelationCoefficient2021a,chiccoMatthewsCorrelationCoefficient2023}, because of the normalization term accounting for class imbalance.
We define (\emph{soft}) $TP$, $FP$, and $FN$ as above and additionally calculate true negative ($TN$): 
\begin{equation}
    TN =  \sum_{i\in I} (1-y_i)        \cdot     \hat{y_i}
\end{equation}
From these definitions, $MCC_{soft}$ is defined as:
\begin{equation}
    MCC_{soft} =  \frac{(TP \cdot TN)-(FP \cdot FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
\end{equation}
The loss formulation is given as: $\mathcal{L}_{MCC} = 1 - MCC_{soft}$~\cite{abhishek2021matthews}.

%\begin{equation}
%    MCC_{soft}=\frac{cs-\vec{t} \cdot \vec{p}}{\sqrt{s^2-\vec{p} \cdot \vec{p}} \sqrt{s^2-\vec{t} \cdot \vec{t}}}
%An extension to the multiclass setting~\cite{gorodkinComparingTwoKcategory2004} has been derived as
%\begin{equation}
%    \mathrm{MCC}=\frac{cs-\vec{t} \cdot \vec{p}}{\sqrt{s^2-\vec{p} \cdot \vec{p}} \sqrt{s^2-\vec{t} \cdot \vec{t}}}
%\end{equation}
%with the following definitions:
%$t_k$ is the number of samples of class $k$,
%$p_k$ is the number of samples predicted as $k$,
%$c$ is the number of correct predictions and
%$s$ is the number of samples.
%The MCC cannot be directly used as a loss function since it is not differentiable due to the non-differentiable thresholding operation needed to determine variables $c$ and $p_k$.
%Hence, we implement a differentiable version where we simply define $c$ as the sum of all class probabilities predicted for the correct class and $p_k$ as the sum of the predicted probabilities for class $k$.
%The new loss formulation is given as: $\mathcal{L}_\mathrm{MCC} = 1 - \mathrm{MCC}_\mathrm{soft}$, where $\mathrm{MCC}_\mathrm{soft}$ is the MCC variant that uses the differentiable definitions, similar to \cite{fooMulticlassClassificationBreast2022}.

\subsubsection{Combined loss functions}
The established cross-entropy loss has desirable theoretical properties.
The imbalance-aware losses presented might focus too heavily on the minority class, leading to an unwanted decrease in performance for the majority class.
Hence, we also evaluate weighted sums of the F1 and the MCC loss with the cross-entropy loss with equal weights for each loss term.

\subsection{Addressing class imbalance with sampling and weighting}
In balanced data scenarios, the samples in a batch are drawn uniformly, i.e., with the same probability, to obtain an equal uniform distribution of each class in a batch.
%However, this can lead to bias towards the majority class, given a dataset with a skewed class distribution, due to the majority class having a larger probability of being drawn and thus influencing the weight updates. 
\paragraph{Oversampling}
One measure to combat class imbalance in deep learning is oversampling the minority class(es) to obtain a more uniform distribution than the original distribution in the dataset.
We implement a stratified oversampling technique, i.e., we allocate equal portions of a batch to each class corresponding to equal clinical relevance of each type.
%This allocation is achieved by determining the portion of a batch for each class in advance: $\lfloor\frac{\text{batch size}}{\text{number of classes}}\rfloor$.
%Then, the corresponding number of samples is randomly drawn from each class.

\paragraph{Sample weighting} 
Instead of oversampling the minority classes, we can assign higher importance to minority class samples by scaling the influence of a sample according to the prevalence of its class $c\in C$ in the loss calculation:
\begin{equation}
    L = \frac{1}{n} \sum_i w_c^{(i)} L^{(i)}
\end{equation}
We choose a normalized inverse frequency~\cite{sparck1972statistical} scaling for each sample.
%, given as 
%\begin{equation}
%     w_c^{(i)} = \frac{1}{n_c} \cdot \frac{N}{C}
%\end{equation}
%where $N$ is the number of samples in the dataset and $n_c$ the number of samples of class $c\in C$. 

