\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{mathtools}
\usepackage{booktabs}
\usepackage{numprint}
\usepackage{hyperref}
\usepackage{rotating}
\usepackage{graphicx,multirow}

\jmlrvolume{}
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}
%\jmlrworkshop{Extended Abstract -- MIDL 2019 submission}
%\editors{Under Review for MIDL 2020}

\title[Continual Learning for Chest X-ray]{Continual Learning for Domain Adaptation \\ in Chest X-ray Classification}

%Three or more authors with the same address:
\midlauthor{\Name{Matthias Lenga} \Email{Matthias.Lenga@philips.com}\\
\Name{Heinrich Schulz} \Email{heinrich.schulz@philips.com}\\
\Name{Axel Saalbach} \Email{axel.saalbach@philips.com}\\
\addr  Philips Research Hamburg, R\"ontgentrasse 24-26, 22335 Hamburg, Germany}


  

\begin{document}
\newcommand{\argmin}{\mathop\text{argmin}}
\newcommand{\argmax}{\mathop\text{argmax}}
\newcommand{\diag}{\mathop\text{diag}}
\newcommand{\dint}{\mathrm{d}}
\newcommand{\E}{\mathbb{E}}
\maketitle

\begin{abstract}
Over the last years, Deep Learning has been successfully applied to a broad range of medical applications.
Especially in the context of chest X-ray classification, results have been reported which are on par, or even superior to experienced radiologists.
Despite this success in controlled experimental environments, it has been noted that the ability of Deep Learning models to generalize to data from a new domain (with potentially different tasks) is often limited.
In order to address this challenge, we investigate techniques from the field of \emph{Continual Learning} (CL) including Joint Training (JT), Elastic Weight Consolidation (EWC) and Learning Without Forgetting (LWF).
Using the ChestX-ray14 and the MIMIC-CXR datasets, we demonstrate empirically that these methods provide promising options to improve the performance of Deep Learning models on a target domain and to mitigate effectively \emph{catastrophic forgetting} for the source domain. To this end, the best overall performance was obtained using JT, while for LWF competitive results could be achieved - even without accessing data from the source domain.
\end{abstract}

\begin{keywords}
Convolutional Neural Networks, Continual Learning, Catastrophic Forgetting, Chest X-Ray, ChestX-ray14, MIMIC-CXR, Joint Training, Elastic Weight Consolidation, Learning Without Forgetting.
\end{keywords}

\section{Introduction}
The availability of multiple hospital-scale chest X-ray datasets and the advances in the field of Deep Learning have facilitated the development of techniques for automatic image interpretation.
Using Convolutional Neural Networks (CNNs), for multiple findings, performance levels were reported which are on par, or even superior to those of a experienced radiologist.
Following the promising results of CNNs for Pneumonia detection in chest X-rays \citep{rajpurkar2017chexnet}, the success of these methods has been  transferred to Cardiomegaly, Edema, and Pleural Effusion \citep{irvin2019chexpert}.
More recently, for all findings in the ChestX-ray14 dataset \citep{wang2017chestx}, a performance similar to radiologists was reported \citep{majkowska2019chest}.
%
%
At the same time, it has been noted that these models can be subject to substantial performance degradations when applied to samples from another dataset or domain \citep{zhang2019mitigating, yao2019strong}.
In single chest X-ray studies, it is commonly expected that the data is independent and identically distributed among the training and test set, a common assumption in Machine Learning.
Contrary, when comparing different publicly available chest X-ray datasets, often a significant domain bias can be observed. Such differences in the data distribution pose a severe challenge for the development, evaluation and validation of medical devices.
The occurrence of such domain-dependent distribution shifts could be, for example, explained by hospital specific processes (including machine protocols and treatment policies), the patient population as well as demographic  factors \cite{zhang2019mitigating}. Furthermore, the data collection strategy and the employed labeling techniques impact the specific characteristics of a dataset.
%
The development of domain-invariant predictors has received increased interest, including methods based on bias regularized loss functions and domain augmentation \citep{zhang2019mitigating} as well as a simultaneous training on multiple datasets \citep{yao2019strong}.
These approaches have shown great promise to mitigate the effect of a domain shift, but were developed for a one-time optimization prior to model deployment.
On the other hand, for most Deep Learning algorithms it is rather straight forward to implement some basic functionality which allows to learn in a continuous fashion (even after deployment) and to improve over time. This includes the adaptation to a new domain and new tasks.
This has also be noted by the FDA in a recent discussion about the regulatory implications of modifications to AI/ML-based medical devices, contrary to "locked" software \citep{fda}.

Therefore, in this contribution, we approach this challenge from a different perspective i.e. using methods from the field of Continual Learning (CL).
Traditionally, in Continual Learning, methods are considered for the sequential learning of individual tasks \citep{parisi2018survey}, a concept with great potential for the adaptation of chest X-ray models to a new domain.
However, a fundamental problem in CL is catastrophic forgetting \citep{mccloskey1989catastrophic}, i.e. a phenomenon which is associated with performance degradations for previously learned tasks when a model is adapted to a new task.
For chest X-ray classification, this could result not only in a reduced detection performance for unique findings from the source domain, but the model could unlearn to classify data from the source domain in general.
%
%
Baseline techniques such as Joint Training (JT), try to alleviate this problem by means of integrating data from the source domain into the learning process - an approach which is not always feasible when sensitive healthcare information is considered.
Regularization-based CL techniques, such as Elastic Weight Consolidation (EWC) \cite{kirkpatrick2017overcoming} and Learning Without Forgetting (LWF) \cite{li2017lwf}, introduce prior information or soft-targets in order to avoid the need for memorizing old data.
In order to evaluate the feasibility of CL techniques, we assess their performance in an empirical study using the ChestX-ray14 and the MIMIC-CXR \citep{johnson2019mimic} dataset.



\section{Material and Methods}
\label{sec:material_and_methods}
In this sections we will provide a brief summary of the three Continual Learning concepts JT, EWC and LWF.
The latter two methods follow simple regularization paradigms and do not require the storing of training data from previous tasks or domains.
All methods are easy to implement and do not entail a large computational overhead compared to the original model training.
%
%
Following the conventions from the CL literature, we use a task-centered formalism to describe the CL methods. In our chest X-ray scenario described above, the first and second  task correspond to solving the ChestX-ray14 and MIMIC-CXR classification problem, respectively.
%
%
In the rapidly growing field of CL other methods, for example, relying on episodic memory, generative models or architectural changes of the network, have been proposed.
For a broader overview we refer the interested reader to \cite{parisi2018survey}, \cite{caruana1997mtl} and the references therein.



\subsection{Joint training (JT)}\label{sub:JT}
Suppose that  $T_i = (x_{i,j}, y_{i,j})_{j = 1,...,N_i}$, $i \in I$ is a sequence of tasks where $x_{i,j}$ denotes  the $j$-th sample of task $i$ and denotes $y_{i,j}$ the corresponding label.
A neural network with weight vector $\theta$ is used to model the predictive distribution $p(y\vert \theta, x)$ of unobserved labels $y$ associated to observed samples $x$.
The model fit is typically conducted by empirical risk minimization. Hence, for each individual task $T_i$, the task-specific optimal weight vector $\theta_i$ is obtained by solving a minimization problem of the type
%
%
%
\begin{equation}\label{eq:empirical_risk}
\theta_i = \argmin_\theta L(\theta, T_i) := \argmin_\theta \sum_{j = 1,..., N_i} -\log p(y_{i,j} \vert \theta, x_{i,j}).
\end{equation}
%
%
%
A joint training strategy (JT) aims at improving the model performance on different tasks simultaneously by combining the task-specific training datasets. For example,
given a subset of tasks $J \subset I $ the optimal weight vector $\theta_J$ on the combined task $T_J := \cup_{i\in J}\, T_i$ is obtained by solving a minimization problem of the type
%
%
%
\begin{equation}
\theta_J = \argmin_\theta L(\theta, T_J)
:= \argmin_\theta \sum_{i\in J} \sum_{j = 1,..., N_i} -\log p(y_{i,j} \vert \theta, x_{i,j})
\end{equation}
%
%
%
allowing the learning process to exploit commonalities and differences across different tasks. This can improve the predictive performance when compared to training multiple task-specific models separately or training a single model in a simple sequential fashion which is prone to catastrophic forgetting, cf. \cite{caruana1997mtl}.
%
%
Unfortunately, in many real world scenarios, the aggregation of large heterogeneous training datasets (e.g. for chest X-ray classification) is subjected to various limitations. In particular, task-specific data used for model training may no longer be available at some future time point when data associated to a new task is obtained and model fine-tuning becomes necessary. 




\subsection{Elastic Weight Consolidation (EWC)}\label{sub:ewc}
Various CL approaches for explicitly modeling cross-correlations between distinct tasks have been proposed. Elastic weight consolidation assumes a prior distribution $p(\theta \vert T_{i-1})$ on the network weights $\theta$ during the model adaptation for task $T_{i}$. The prior $p(\theta \vert T_{i-1})$ is selected in such a way that it captures basic statistical properties of the empirical distribution of the network weights across the previous task $T_{i-1}$.  Finally, the optimal parameter for the current task $T_i$ is obtained as the maximum a posteriori estimate
%
%
%
\begin{equation}
\theta_i
:= \argmax_\theta \sum_{j = 1,..., N_i} \log p(\theta \vert x_{i,j},y_{i,j} )
= \argmin_\theta  L(\theta, T_i) - N_i\log p(\theta \vert T_{i-1} ).
\end{equation}
%
%
%
In contrast to memory-based methods, EWC acts as a simple regularizer on the training objective and does not rely on storing any additional data associated to previous tasks. The key assumption of EWC is that enough information about previous tasks can be encoded within the model weight prior distribution in order to prevent a severe performance degradation when moving to a new task.
Owing to their computational tractability, frequent choices for $p(\theta \vert T_{i-1})$ are multivariate Laplace or Gaussian distributions. In the Gaussian case $p(\theta \vert T_{i-1}) = \mathcal{N}(\theta \vert \mu_{i-1}, \Sigma_{i-1})$ we obtain
%
%
%
\begin{equation}\label{eq:ewc_objective}
\theta_i = \argmin_\theta  L(\theta, T_i) + \lambda(\theta - \mu_{i-1})^\top \Sigma_{i-1}^{-1} (\theta - \mu_{i-1})
\end{equation}
%
%
%
with a constant $\lambda>0$ which allows to regulate the impact of the prior. Choosing the parameters $\mu_{i-1} = \theta_{i-1}$ and $\Sigma_{i-1}^{-1}  = \diag(F_{i-1})$, where $F_{i-1}$ denotes the empirical Fisher matrix associated to task $T_{i-1}$, i.e.
%
%
%
\begin{equation}\label{eq:empirical_hessian}
F_{i-1}
:= \frac{1}{N_{i-1}} \sum_{j=1,...,N_{i-1}} \nabla_\theta \log p(y_{i-1,j} \vert \theta_{i-1}, x_{i-1,j}) \, \nabla_\theta \log p(y_{i-1,j} \vert \theta_{i-1}, x_{i-1,j})^\top,
\end{equation}
%
%
%
yields the EWC objective from \citep{kirkpatrick2017overcoming}.
It is well known that under mild regularity assumptions \eqref{eq:empirical_hessian} constitutes an approximation to the empirical Hessian of the negative log-likelihood (NLL) with respect to $\theta$, i.e.
%
%
%
\begin{equation}
\E_{y\sim p\vert \theta, x}\, H_\theta [ -\log p(y \vert \theta, x) ]
= \E_{y\sim p\vert \theta, x}\, \nabla_\theta \log p(y\vert \theta, x) \, \nabla_\theta \log p(y \vert \theta, x)^\top
\end{equation}
%
%
%
holds true. Consequently, the entries of $\diag(F_{i-1})$
may be considered as approximations to the non-mixed second derivatives of the NLL, which reflect to some extend the sensitivity of the model output with respect to marginal changes in the network weights. As argued in \citep{kirkpatrick2017overcoming}, second derivatives of large magnitude attribute a high importance of the corresponding model parameter for solving the task $T_{i-1}$. Consequently, the quadratic penalty term in  \eqref{eq:ewc_objective} discourages strong deviations from the previous task's parameter $\theta_{i-1}$ in the sensitive weight space directions.
%
%
%
%
In summary, by imposing a prior $p(\theta \vert T_{i-1})$ on the model weights, deviations from $\theta_{i-1}$ are penalized while learning the task $T_{i}$ parameter $\theta_i$.
The magnitude of the penalty depends of the choice on the prior. For example, prior distributions which are highly concentrated at $\theta_{i-1}$ may severely constrain the flexibility of the model to adapt to the new task $T_i$ in favor of preserving the model performance on $T_{i-1}$. Elastic weight consolidation acts as a regularizer for the current task's model weights and does not require to store the training data from previous tasks.



\subsection{Learning Without Forgetting (LWF)}\label{sub:lwf}
The key idea of the Learning Without Forgetting method is to
introduce a soft-target regularization into the training loss associated to the current task which reflects the behavior of the model associated to the previous task on the dataset at hand.

In more detail: When moving to a new task
$T_i = (x_{i,j}, y_{i,j})_{j = 1,...,N_i}$ we apply the previous
model $M_{\theta_{i-1}}$ which was trained on $T_{i-1}$ to the current task's training samples $x_{i,j}$ in order to generate ``synthetic labels'' $\hat{y}_{i,j} := M_{\theta_{i-1}}(x_{i,j})$ which record the model behavior. Please note that the raw model outputs $\hat{y}_{i,j}$ correspond, depending on the implementation, to float-valued tensors rather than integer class assignments.
By adding a regularization term to the loss functional  \eqref{eq:empirical_risk}, a bias towards a consistent behavior of the models $M_{\theta_{i}}$ and $M_{\theta_{i-1}}$ on the current task's training samples is introduced.
The task $T_i$ optimal model weight $\theta_i$ is then obtained by solving a minimization problem of the type
%
%
%
\begin{equation}\label{eq:lwf_loss}
\theta_i = \argmin_\theta
\sum_{j = 1,..., N_i} -\log p(y_{i,j} \vert \theta, x_{i,j}) - \lambda \log p(\hat{y}_{i,j} \vert \theta, x_{i,j}).
\end{equation}
%
%
%
Increasing the parameter $\lambda > 0$ decreases the relevance of the ``hard-labels'' $y_{i,j}$ associated to $T_i$ and instead rewards model output patterns which are consist with the previous model.
For a detailed discussion of LWF in the classification setting we refer the reader to \cite{li2017lwf}.
This basic concept can be implemented and extended in various ways. For example, in the classification setting the soft-target concept can be used to fill missing labels when fine-tuning a model on a new dataset where only partial annotations are available.
Similar to EWC, this approach acts as a mere regularizer for the current task's model weights. Access to the previous task's training data is not required.



\subsection{Datasets}
In following we consider the datasets ChestX-ray14 \citep{wang2017chestx} and MIMIC-CXR \citep{johnson2019mimic}.
The ChestX-ray14 data was released in 2017 by the NIH Clinical Center and consists of 112120 chest X-ray images (AP/PA) from 30805 patients. The images in the dataset were annotated with respect to 14 different findings using an NLP-based analysis of the radiology reports (with an additional "No Findings" label which is typically not considered).


The MIMIC-CXR dataset (consortium version v2.0.0) consists of X-ray images (DICOM) and radiology reports from the Beth Israel Deaconess Medical Center in Boston. For model training and evaluation, we filtered the DICOM data for AP/PA chest X-ray images (based on the DICOM attributes ImageType, PresentationIntentType, PhotometricInterpretation, BodyPartExamined, ViewPosition and PatientOrientation) resulting in a dataset with 226483 images from 62568 patients.
%
%
In order to generate annotations, we applied the CheXpert labeler to the impression section of the reports, yielding annotations for 13 findings and a "No Finding" label \citep{irvin2019chexpert}\footnote{For convenience we adopted the U-Zeroes approach from \citep{irvin2019chexpert} for uncertain labels.}.
In contrast to the ChestX-ray14 dataset, for MIMIC-CXR no official train/test split is available.
Therefore, we selected randomly 80\% of the patients for training while the remaining 20\% were assigned to the test split.
For the following experiment, it is assumed that matching labels (including "Effusion" and "Pleural Effusion") represent comparable concepts in both datasets.
Consequently, we consider in total 21 labels with 7 unique findings for each dataset and 7 findings occurring in both datasets, see Table \ref{tab:results}.





\subsection{Experimental Design}\label{sec:experimental_design}
In order to investigate the impact of a domain shift in the data distribution and the potential benefit of the CL methods outlined in \ref{sub:JT}, \ref{sub:ewc} and \ref{sub:lwf}, a set of networks was adapted first to ChestX-ray14 and subsequently to MIMIC-CXR.
To this end, a pre-trained DenseNet121 \citep{huang2017densely} was selected as a starting point as it is one of the most commonly employed neural network types in the X-ray domain.
%
%
In order to account for the changed number of labels and the multi-label classification task, the last layer was replaced by a randomly initialized linear layer and a sigmoid activation function.
%
%
For the first and second adaptation step a similar hyper-parameter setup was employed:
Binary cross entropy was used as a loss function, while for all training scenarios - except LWF - the computation of the loss (training and validation) was restricted to the labels from the current domain.
Stochastic gradient descent with momentum was used as update rule, with an initial learning rate of 0.01, a momentum of 0.9 and a mini-batch size of 16.
%
%
For the adaption to ChestX-ray14 a
$\text{L}_2$ weight decay of 0.0001 was employed, whereas for the MIMIC-CXR task, weight decay was disabled.
%
%
After each epoch, the learning rate was reduced by a factor of 10 if the validation loss did not improve.
During the training, the images in a mini-batch were subject to data augmentation with a probability of 90\%. Our data augmentation included common strategies such as: scaling ($\pm 15\%$), rotation around the image center ($\pm 5^\circ$), translation relative to the image extend ($\pm 10\%$) as well as mirroring along the midsagittal plane ($50\%$ chance).
Finally, all images were rescaled to $224\times224$ pixel in order to match  the input size of the DenseNet121 architecture. 
After training, the network with the lowest validation loss was used for the processing of the test dataset. All experiments were repeated 5 times with resampled validation sets (using $10\%$ of all patients).

The ChestX-ray14 model was adapted on the MIMIC-CXR dataset using four different training strategies:
%
%
\begin{enumerate}
\item A standard fine-tuning of the networks using the MIMIC-CXR data only.
%
%
\item A JT setup where $20\%,\dots,100\%$ of the ChestX-ray14 data was included into the adaptation process in addition to the MIMIC-CXR data, cf. Section \ref{sub:JT}.
%
%
\item Fine-tuning on the MIMIC-CXR using EWC regularization with a Gaussian prior distribution on the model weights and an impact of $\lambda=0.001$, cf. Section \ref{sub:ewc}. \\
%
For each fold, the mean and the covariance matrix of the prior was calculated based on the associated final model trained on the  ChestX-ray14 data.
The parameter vector of the ChestX-ray14 model was selected as mean $\mu_{i-1}$ in the EWC objective \eqref{eq:ewc_objective}.
As inverse covariance matrix we chose the binarized diagonal of the empirical Fisher matrix \eqref{eq:empirical_hessian} calculated over all ChestX-ray14 training samples with sensitivity threshold of $\rho = 0.001$.
%
%
That is to say, we chose $\Sigma_{i-1}^{-1}  =  \diag(F_{i-1} > \rho)$, where $F_{i-1}$ is defined as in Equation \eqref{eq:empirical_hessian}. Consequently, all network parameters with a sensitivity below $\rho$ are not affected by the regularization. All other parameters are shrunk towards $\mu_{i-1}$ uniformly with the  rate $\lambda$. We found it useful to select $\rho$ based on the distribution of the main diagonal entries of $F_{i-1}$. For example, setting $\rho$ to the 95\%-quantile imposes a uniform regularization on the 5\% most sensitive network weights. The intuition behind this binarized EWC version is rather simple: we decompose the weight space of the neural network into a subspace containing the sensitive dimensions and its complement. Then a uniform $\text{L}_2$-regularization is applied to the weight vector projected on the ``sensitive'' subspace. Clearly, the computational overhead of this binary EWC is lower compared to classic EWC.
%
%
\item Fine-tuning on the MIMIC-CXR data using LWF regularization with an impact parameter $\lambda = 2.0$, cf. Section \ref{sub:lwf}. 
To facilitate the adaption of the model to the new domain, we applied the LWF penalty only to the 7 labels \emph{not} present in the MIMIC-CXR dataset.
Consequently, in the LWF setting all 21 labels from both domains are considered, wheres for EWC and JT-0\% only labels from MIMIC-CXR are taken into account. 
However, the validation loss is always computed on the domain specific validation data containing only 14 labels.
\end{enumerate}






\section{Results}
\label{sec:results}
Our quantitative results in terms of average AUC values for each finding along with their standard deviations are summarized in Table \ref{tab:results}. In the upper row the model performance on the ChestX-ray14 dataset is given, while the bottom row corresponds to the performance on the MIMIC-CXR dataset. The left column (Initial) indicates the performance after an initial training on ChestX-ray14, whereas the right columns (JT-0\%, JT-20\%, \dots, LWF) contain the results after the model adaptation to MIMIC-CXR.
When applying the models trained on ChestX-ray14 directly to the MIMIC-CXR data, a decreased performance for the classes Cardiomegaly, Edema, Pneumonia and Pneumothorax can be observed. This indicates that the source domain training data is not representative enough for the target domain data distribution. The strongest decrease is observed for Cardiomegaly with a drop from 0.8806 to 0.7603 mean AUC.
For the classes Atelectasis, Consolidation and Effusion the performance on the target domain is comparable or even slightly superior, see lower left quadrant of Table \ref{tab:results}. As a consequence of the domain shift, the average AUC across all labels decreases from 0.8106 to 0.7833 making model adaptation unavoidable. The lower right quadrant shows that all CL methods achieve a formidable on-domain model performance on the MIMIC-CXR data with average AUC values across all findings ranging from 0.8190 to 0.8257. In particular, this indicates that both regularization approaches (LWF and EWC) still allow for enough flexibility that the model can adjust to the new domain.

However, a simple adaptation to MIMIC-CXR with no CL strategy (JT-0\%) leads to a decrease of the mean AUCs on the ChestX-ray14 domain for all classes except Infiltration, Pneumonia and Pneumothorax.
The effect of catastrophic forgetting becomes more evident in Figure \ref{fig:results}, which depicts the (averaged) Forward (FWT) and Backward-Transfer (BWT) for all findings. These concepts were introduced by \cite{lopez2017gradient} in order to measure the knowledge transfer across a sequence of tasks.\footnote{In contrast to \cite{lopez2017gradient} we employed an AUC-based variant of these measures.}
%
%
The BWT measures the changes of model performance on a task $T_{i}$ after adapting to a new task $T_{i+1}$. In detail, for each individual label the BWT is computed by subtracting the task $T_{i}$ AUC values (prior to adapting the model to $T_{i+1}$) from the task $T_{i+1}$ AUC values. A negative BWT is often associated with catastrophic forgetting. Contrary, a positive BWT is obtained if the performance on the previous task is increased.
%
%
Similarly, the FWT measures the effect of learning a task $T_{i}$ on the performance of a future task $T_{i+1}$ which was not seen during training. In detail, for each individual label the FWT is computed by subtracting $0.5$ (AUC of random classifier) from the task $T_{i+1}$ AUC values (without adapting the task $T_{i}$ model to $T_{i+1}$).
%
%
%
\begin{figure}
\centering
\includegraphics[trim={0 0.3cm 0 0.35cm},clip,width=0.95\linewidth]{FWTBWT.pdf}
\caption{Left: Backward Transfer on ChestX-ray14 after adaptation using different Continual Learning (CL) techniques. Right: Forward-Transfer (FWT) for a chest X-ray14 model on MIMIC-CXR. Bars indicate min, mean and max.}
\label{fig:results}
\end{figure}
%
%
%
%
While the ChestX-ray14 models achieve a moderate FWT on MIMIC-CXR, the low BWT indicates a considerable drop in performance on ChestX-ray14 after the adaptation (JT-0\%).
Integrating data from ChestX-ray14 into the training on the new domain allows to mitigate this effect (JT-20\%,\dots, JT-100\%).
We observe that the BWT is positively correlated with the amount of additional samples from ChestX-ray14.
Not surprisingly, the best model performance is achieved on the combined dataset containing all training samples from both domains (JT-100\%).
As argued above, in real world scenarios access to old training data might be limited or not possible at all.
Consequently, the regularization based methods LWF and EWC which do not rely on storing data from previous tasks or domains are of high practical relevance. 
In our experiments, LWF outperformed the EWC approach and achieved a performance on the original domain between JT-60\% and JT-80\% (and superior to the original model) without accessing any data from ChestX-ray14. 





\section{Conclusion}
\label{sec:conclusion}
In this paper we investigated the applicability of different Continual Learning methods  for domain adaptation in chest X-ray classification.
To that end, a DenseNet121 was trained on ChestX-ray14 and subsequently fine-tuned on MIMIC-CXR using different Continual Learning strategies (JT, EWC, LWF) in order to adapt to the new domain without severe performance degradations on the original data. The motivation for choosing these datasets as distinct domains, was to simulate a realistic domain shift as encountered in clinical practice.
Our quantitative evaluation, including the measurement of Backward and Forward Transfer, confirmed that employing these methods indeed improves the overall model performance, compared to a simple continuation of the model training on the new domain.
The best performance was achieved by JT-100\%, i.e. training the model on the entire combined datasets from both domains. However, in real world scenarios, e.g. adapting models which are already deployed in the clinic, for legal and privacy reasons it is questionable that the data used for training the original model is always accessible. Hence, the EWC and LWF methods which do not rely on old training samples are of high practical relevance. Our experiments indicate that these regularization techniques indeed allow a model adaption to the target domain while preserving a performance on the original domain which is still close to the JT baseline.





% RESULT TABLE
\begin{sidewaystable}
 \centering
{\begin{scriptsize}
\include{auc}
\end{scriptsize}}
\caption{Model performance measured in mean AUC and standard deviation on ChestX-ray14 and MIMIC-CXR. Classes marked with a star $(\star)$ appear in both datasets.
\textbf{Upper and lower left quadrant}: mean AUCs related to model trained on ChestX-ray14. On-domain performance (ChestX-ray14 model on ChestX-ray14) in upper left, performance on unseen target domain (ChestX-ray14 model on MIMIC-CXR) in lower left.
\textbf{Upper and lower right quadrant}:  mean AUCs related to fine-tuned ChestX-ray14 model on the MIMIC-CXR domain using different Continual Learning strategies - simple fine-tuning without additional data (JT-0\%), JT with 20\% - 100\% inclusion of ChestX-ray14 data, EWC and LWF.
On-domain performance (fine-tuned MIMIC-CXR model on MIMIC-CXR) in lower right, performance of original domain (fine-tuned MIMIC-CXR model on ChestX-ray14) in upper right.}
\label{tab:results}
\end{sidewaystable}

\newpage
\bibliography{Lenga20}



\end{document}
