\section{Method} \label{sec:method}

In this section, we introduce our ECG-DualNet (Fig. \ref{fig:ecg_dual_net}). We take inspiration from the model by Mousavi \etal which utilizes two separate network paths to encode both time and frequency domain data \cite{Mousavi2019}. Our data augmentation pipeline builds on the work of Nonaka \etal which presented multiple augmentations for ECG data \cite{Nonaka2020}.


\begin{figure}[!ht]
    \centering
    \scalebox{0.5}{\input{artwork/ecg_dual_net}}
    \caption{ECG-DualNet++ architecture with spectrogram and ECG signal as inputs. The ECG signal sequence gets encoded by a Transformer encoder to a single latent vector. The spectrogram is encoded by multiple 2D blocks. The first two blocks are standard ResNet like blocks and the following three blocks are Axial-Attention blocks. All blocks of the spectrogram encoder utilize the latent vector by conditional batch normalization. Transformer encoder \cite{Vaswani2017} architecture shown in the top right.}
    \label{fig:ecg_dual_net}
\end{figure}

\subsection{ECG-DualNet Architecture} \label{subsec:ecg_dualnet}

Our ECG-DualNet (Fig. \ref{fig:ecg_dual_net}) utilizes two separate encoders. The first encoder takes the ECG signal as the input. We will refer to this encoder as the signal encoder. The second encoder is fed with a spectrogram. We refer to this encoder as the spectrogram encoder. The final part of the network builds a simple softmax classifier to produce the final classification. We utilize two network settings, ECG-DualNet and ECG-DualNet++ in which we use different building blocks for both encoders. \\
\indent In the standard-setting (ECG-DualNet), the signal encoder consists of a standard long-short-term-memory (LSTM) \cite{Hochreiter1997} module. The LSTM module encodes temporal agnostic feautres into a latent vector. This latent vector is incorporated into the spectrogram encoder and the final linear layer. In the ECG-DualNet++ we replace the LSTM module with a Transformer encoder \cite{Vaswani2017, Dosovitskiy2020} (Fig. \ref{fig:ecg_dual_net} top right), using learnable encodings \cite{Vaswani2017, Reich2020b}, Layer Normalization \cite{Ba2016}, and Gaussian Error Linear Units \cite{Hendrycks2016}. To encounter overfitting dropout \cite{Srivastava2014} is used in both the LSTM and the Transformer. \\
\indent The spectrogram encoder is, in the standard setting, comprised of five ResNet-like \cite{He2016} blocks. Each block consists of two 2D convolutions, two Pad\'{e} Activation Units \cite{Molina2020} two Conditional Batch Normalization layers (CBN) \cite{De2017}, an average pooling layer, and a skip connection. CBN is utilized to conditionalize the spectrogram encoder on the latent vector of the signal encoder. For ECG-DualNet++ the three highest blocks of the spectrogram encoder are replaced with Axial-Attention blocks \cite{Wang2020}. Still, CBN and Pad\'{e} Activation Units are employed. Similar to the signal encoder, dropout \cite{Srivastava2014} is also applied in each spectrogram encoder block. \\
\indent To investigate the effect of the network size we employ different network sizes for both the ECG-DualNet and ECG-DualNet++. We vary the width and the depth of the signal encoder. For the spectrogram encoder, we diversify the width.

\subsection{Training Approach} \label{subsec:training}

We train all networks on a weighted version of the cross-entropy loss \cite{Goodfellow2016}

\begin{equation} \label{eq:loss}
    \mathcal{L} = -\frac{1}{N}\sum_{j=1}^{N}\sum_{i=1}^{4}\alpha_{i}\,y_{ji}\,\log(\hat{y}_{ji}).
\end{equation}

Where $\mathbf{y}_{j}\in\mathbb{R}^4$ is the ground truth one-hot label, $\hat{\mathbf{y}}_{j}\in\mathbb{R}^4$ the network softmax prediction, and $\mathbf{\alpha}\in\mathbb{R}^4$ the class weighting. The cross-entropy loss is averaged over a mini-batch of the size $N$. The loss function (Eq. \ref{eq:loss}) is minimized by using the RAdam optimizer \cite{Liu2020}.

\subsection{Validation Approach} \label{subsec:validation}

To validate the performance of our networks we utilize the accuracy and the F1 score. The accuracy is computed over all classes by

\begin{equation}\label{eq:acc}
    \operatorname{ACC}=\frac{1}{n}\sum_{j=1}^{n}\delta\left(\arg\max(\mathbf{y}_{j}), \arg\max(\hat{\mathbf{y}}_{j})\right).
\end{equation}

Where $\delta(\cdot, \cdot)$ is the Kronecker delta, and $\arg\max(\cdot)$ estimates the position of the maximum value present in the input vector. $n$ corresponds to the dataset size. The F1 score is computed as

\begin{equation}\label{eq:f1}
    \operatorname{F1}=\frac{1}{4}\sum_{i=1}^{4}\frac{2\text{TP}_{i}}{2\text{TP}_{i} + \text{FP}_{i} + \text{FN}_{i}}.
\end{equation}

$\text{TP}_{i}$ represents the true positive predictions of a class over the whole dataset, $\text{FP}_{i}$ the false positive predictions, and $\text{FN}_{i}$ the false-negative predictions.

\subsection{Preprocessing} \label{subsec:preprocessing}

We utilize a simple preprocessing composed of four steps. In the first steps, the ECG signal gets standardization to a mean of zero and unit variance. In the second step, various augmentations (Sec. \ref{subsec:data_augmentation}) are applied to the ECG signal. In the following step, a log spectrogram of the ECG signal is produced. The log spectrogram is computed with a window length of 64, a hop size of 32, and 64 bins. Recent work showed, using the logarithmic spectrogram improves the classification accuracy of CNN's \cite{Zihlmann2017}. Finally, both the ECG signal and the spectrogram are zero-padded to a fixed length.

\subsection{Data Augmentation} \label{subsec:data_augmentation}

Our augmentation pipeline applies randomly multiple different data augmentations to the ECG signal. This improved the generalization of the trained network and prevents overfitting \cite{Perez2017, Nonaka2020, Hatamian2020}. The following augmentations are used: dropping, cut-out, resampling, random resampling, scaling, shifting, sine addition, and bandpass filtering. The dropping augmentation sets random samples of the ECG signal to zero, the cut-out augmentation sets a random sequence to zero. In the resampling augmentation, the signal gets resampled to a different heartbeat rate. Random resampling is inspired by the random elastic deformation \cite{Simard2003, Ronneberger2015, Reich2020a} used for image augmentation. The ECG signal gets resampled by smooth random offset, resulting in a changing heartbeat rate. In the scaling augmentation, the signal gets scaled by a random factor. The sine addition augmentation adds a sinusoidal signal with a random magnitude and phase to the ECG signal. The shift augmentation shifts the ECG signal by a random length. Finally, in the band pass filter augmentation, the ECG signal is filtered by a band-pass.

\subsection{Implementation Details} \label{subsec:implementation}

ECG-DualNet is implemented in PyTorch \cite{Paszke2019}. For implementing the preprocessing SciPy \cite{Virtanen2020}, NumPy \cite{Harris2020}, and Torchaudio were used, in addition. For the Pad\'{e} Activation Unit, we used the implementation by the authors \cite{Molina2020}.\\
\indent Different network sizes (S, M, L, and XL) for both ECG-DualNet and ECG-DualNet++ (Tab. \ref{tab:results}) were employed. A detailed overview of the network configurations can be found in the appendix or in the provided implementation.\\
\indent $100$ training epochs with a fixed batch size of $24$ were performed. All models except ECG-DualNet++ $130\si{\mega}$ were trained on a single Nvidia 2080 Ti. Training took between $30\si{\min}$ (ECG-DualNet S) and $3\si{\hour}$ (ECG-DualNet++ XL). Our biggest model ECG-DualNet++ $130\si{\mega}$ was trained on four Nvidia Tesla V100 ($16\si{\giga\byte}$) which took approximately $6\si{\hour}$. The weights of the loss function (Eq. \ref{eq:loss}) were set to $\mathbf{\alpha}=[0.4, 0.7, 0.9, 0.9]$, for counteracting the dataset class imbalance (Sec. \ref{subsec:physionet_dataset}). The initial learning rate of the RAdam optimizer was set to $10^{-3}$. The learning rate were decreased after $25$, $50$, and $75$ epochs by $0.1$. The first and second-order momentum factors were set to 0.9 and 0.999, respectively. Each augmentation described in Section \ref{subsec:data_augmentation}was applied with a probability of $0.2$. An overview of all hyperparameters is presented in the appendix.\\
\indent We also consider pre-training on the Icentia$11$k dataset \cite{Tan2019}. Similar to pre-training in computer vision \cite{Zeiler2014, Girshick2014, He2019}, we first train on the very large Icentia$11$k dataset. Afterward, the pre-trained weights were used as the initial weights for training on the target PhysioNet dataset. For pre-training, we perform $20$ epochs with a batch size of $100$ on a single Nvidia Tesla V100 ($32\si{\giga \byte}$). Training took approximately $24\si{\hour}$. During pre-training, the same learning rate schedule as described earlier with steps after $5$, $10$, and $15$ epochs was employed. The training on the target dataset was performed as a normal training run on the PhysioNet dataset, described earlier.