\section{Discussion} \label{sec:discussion}
The conducted experiments showcased that classical overfitting does not occur. Data augmentation and dropout lead to a better generalization. Omitting the respective methods results in a weaker classification performance but not in strong overfitting. This holds for all ECG-DualNet variants, even for the larges ECG-DualNet++ $130\si{\mega}$, which includes significantly more parameters as data samples are in the 2017 PhysioNet dataset. This observation may description by the double decent phenomenon \cite{Belkin2019, Nakkiran2020}. A large-scale study with more network sizes may help to understand this behavior.\\
\indent Vice versa to the current findings in computer vision \cite{Zeiler2014, Girshick2014, He2019} and natural language processing \cite{Devlin2018, Brown2020}, large-scale pre-training on the Icentia$11$k dataset does not lead to better classification results after fine-tuning on the target 2017 PhysioNet dataset. A large difference in the dataset domains can presumably be one reason for this behavior. However, a full explanation of this behavior can not be provided.\\
\indent In the conducted ablation studies we showcased that all core components of the proposed pipeline are important to achieve a competitive classification performance. The most significant core component for a strong classification performance is the spectrum encoder. This indicates that the spectrum is a very suitable representation to learn a deep network from.
% \indent We also observe the phenomenon that an increase in the model size with and without data augmentation does not lead to overfitting but more generalization. We have no clear explanation for this behavior. A potential description could be the double decent phenomenon \cite{Belkin2019, Nakkiran2020}, without an overfitting bump when regularization (data augmentation \& dropout) is applied. A large-scale study with more network sizes may help to understand the observed phenomenon.\\
% \indent Surprisingly, performing pre-training on the Icentia11k does not lead to better classification performance on the target dataset. This could be due to the fact that the dataset used for pre-training is too out-of-domain transferring knowledge to the target dataset.\\
% \indent We also observe from the results in Table \ref{tab:results} that ECG-DualNet++ XL and 130M perform similarly to ECG-DualNet XL. For smaller models, the ECG-DualNet surpasses the classification accuracy of ECG-DualNet++. However, on the Icentia$11$k dataset ECG-DualNet++ outperforms  ECG-DualNet in the F1 score. This may indicate that attention-based approaches need more data to surpass the performance of traditional CNNs and LSTMs for ECG classification.
% As observed in the ablations (Sec. \ref{subsec:ablations}) training without data augmentation leads to a zero valued loss. Experimental results without data augmentation are presented in Table \ref{tab:ablations}. Utilizing the proposed augmentation pipeline improves generalization. Unfortionalty, the use of a sophisticated augmentation pipeline comes with additional hyperparameters. We set the hyperparameters of our augmentation pipeline empirically based on a few test training runs. To use the full potential of the proposed augmentation pipeline hyperparameter optimization \cite{Goodfellow2016, Cubuk2019} or the development of an adaptive approach, such as \cite{Fawzi2016, Karras2020}, may lead to improvements.\\