\documentclass{midl}

\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021}

\title[Intensity Correction and Standardization for Electron Microscopy Data]{Intensity Correction and Standardization for Electron Microscopy Data}

\usepackage{graphicx}
\usepackage{nicefrac}
\usepackage{textgreek}

\allowdisplaybreaks

\graphicspath{{./Figures/}
              {./Figures/Figure1/}
              {./Figures/Figure2/}
              {./Figures/Figure3/}
              {./Figures/Figure4/}
              {./Figures/Figure5/}
              {./Figures/Figure6/}}
\DeclareGraphicsExtensions{.eps}

\def\x{\mathbf{x}}
\def\y{\mathbf{y}}
\def\a{\mathbf{a}}
\def\b{\mathbf{b}}
\def\c{\mathbf{c}}

\newcommand\mystrut{}
\def\mystrut(#1){\vrule height #1pt depth #1pt width 0pt}

\midlauthor{%
\Name{Oleh Dzyubachyk\nametag{$^{1,2}$}} \Email{o.dzyubachyk@lumc.nl}\\
\Name{Roman I. Koning\nametag{$^{1}$}} \Email{r.i.koning@lumc.nl} \\
\Name{Aat A. Mulder\nametag{$^{1}$}} \Email{a.a.mulder@lumc.nl}\\
\Name{M. Cristina Avramut\nametag{$^{1}$}} \Email{m.c.avramut@lumc.nl}\\
\Name{Frank G.A. Faas\nametag{$^{1}$}} \Email{f.g.a.faas@lumc.nl}\\
\Name{Abraham J. Koster\nametag{$^{1}$}} \Email{a.j.koster@lumc.nl}\\
\addr $^{1}$ Section Electron Microscopy, Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands \\
\addr $^{2}$ Division of Image Processing, Department of Radiology, Leiden University Medical Center, Leiden, the Netherlands \AND
}


\begin{document}

\maketitle

\begin{abstract}
Intensity of acquired electron microscopy data is subjected to large variability due to the interplay of many different factors, such as microscope and camera settings used for data acquisition, sample thickness, specimen staining protocol and more. In this work, we developed an efficient method for performing intensity inhomogeneity correction on a single set of combined transmission electron microscopy (TEM) images and demonstrated its positive impact on training a neural network on these data. In addition, we investigated what impact different intensity standardization methods have on the training performance, both for data originating from a single source as well as from several different sources. As~a concrete example, we considered the problem of segmenting mitochondria from EM data and demonstrated that we were able to obtain promising results when training our network on a large array of highly-variable in-house TEM data.
\end{abstract}

\begin{keywords}
{Transmission electron microscopy, Mitochondria segmentation, Intensity correction, Intensity standardization}
\end{keywords}

\section{Introduction}
\label{sec:intro}

Even though quantification of electron microscopy (EM) data has received a significant boost with the advance of machine-learning techniques, the number of publications on this type of imaging still lags greatly behind other modalities. In particular, the work of \citet{bib:Lucchi2013} was one of the first publications on machine-learning-based segmentation of mitochondria from EM data that received significant attention in the community. The data that were made publicly available by the authors had become a benchmark for virtually all subsequent studies on this topic. Other publications that attracted large attention in this area are the works of \citet{bib:Haberl2018} and \citet{bib:Xiao2018}. In particular, the latter publication described an elaborate network design for segmentation of mitochondria from EM data that served as a basis for follow-up publications of other groups \cite{bib:Casser20a}.

While network design has received significant attention in the literature, the problem of standardization (harmonization) of EM data remains largely unaddressed. The only type of data preprocessing we are able to find in the related publications was histogram equalization \cite{bib:Xiao2018}, but even in this case no further implementation details were provided. This fact might be attributed to rather small data size on which these methods were trained and tested. However, our in-house data set is much larger: hundreds of data sets, each consisting of several hundred or even several thousand of separate frames. This inevitably leads to significant variability of the intensity distribution of these images, which, in combination with large diversity in the appearance of the mitochondria themselves due to biological variations and staining properties (see Section~\ref{sec:data}), renders quantification of such data extremely challenging.

The aim of this research is to develop targeted preprocessing methods for EM data, in particular, intensity correction and standardization, with the final goal of developing a machine-learning-based segmentation approach for processing a wide variety of EM images. As a concrete example, in this work we considered the problem of automated segmentation of mitochondria. However, we expect the performed analysis and developed approaches to be generic and to have capacity to be extended to a wide range of similar problems, not limited to EM data. In the remainder of this manuscript, we present our experiments and draw several important conclusions from the results.


\section{Data}
\label{sec:data}

For this analysis, we selected several image data sets from our in-house data acquired as part of the same project. The images were acquired with a digital charge coupled (CCD) camera (One View, Gatan Inc., Pleasanton, USA) mounted on a Tecnai 12 TWIN transmission electron microscope (FEI, Eindhoven, the Netherlands) operating at 120~kV. CCD images of fixed and positively stained samples \cite{bib:Giacomelli2020} were collected with binning 2 and an overlap of 20\% and stitched together into one large image mosaic, as described previously \cite{bib:Faas2012}. The samples for acquiring all the data analyzed in this project originated from the kidney tissue of 3 different individuals (donors) and were imaged at a single (one donor) or two different time points (two donors).

Each image mosaic typically consists of several hundred of separate frames, $2048\times2048$ pixels large (pixel size = 3.35~nm$^2$). Three experts on this type of EM images have selected regions of interest (ROI's) from five mosaics and manually segmented mitochondria on the original frames belonging to these ROI's using custom in-house annotation software. Each data set was annotated by one expert. Statistics on annotated frames and total number of annotated objects are provided in Table~\ref{tbl:data_stats}. A typical frame from each data set and the corresponding mitochondria annotations are shown in Figure~\ref{fig:SampleImage}. These images confirm significant variability of mitochondria appearance with respect to size, shape and intensity distribution. As we are using exactly the same preparation and acquisition protocol for all these data sets, the observed differences are entirely caused by the underlying biological factors. Quantitative analysis of these differences is the main objective of the large research project, with this work being the first step towards achieving this goal.
A contiguous $10\times10$ frames region was additionally selected from each of the annotated data sets, close to the location of the annotated ROI, for developing and testing the intensity inhomogeneity correction algorithm.

\begin{table}[!tb]
\begin{center}
\renewcommand{\arraystretch}{1.2}
\caption{Number of annotated frames and mitochondria from each selected data set}
\label{tbl:data_stats}
\small
\begin{tabular}{p{20mm}|p{15mm}|p{15mm}|p{15mm}|p{15mm}|p{15mm}}
  \hline
                & \centering{2922Q1} & \centering{2922Q4} & \centering{2929L4} & \centering{2929Q1} & \centering{2929Q4} \tabularnewline
  \hline
  Frames        & \centering{20} & \centering{44} & \centering{58} & \centering{251} & \centering{55} \tabularnewline
  Mitochondria  & \centering{222} & \centering{598} & \centering{1927} & \centering{2764} & \centering{745} \tabularnewline
  \hline
\end{tabular}
\end{center}
\end{table}

\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure1}
\centering
\caption{\small Sample frame from each data set (top) and the corresponding annotation (bottom). Contrast of the images in the top row was increased for visualization purposes. The~length of the scale bar is 1~\textmu{m}.}
\label{fig:SampleImage}
\end{figure*}


\section{Experiments and Results}
\label{sec:results}

In this section, we present setups and results of three experiments designed to assess data quality improvement after applying intensity correction and standardization on the selected EM data. The first experiment was executed on the non-annotated data originating from the five data sets, and the remaining two experiments were executed on the annotated data. In the remainder of this manuscript, we will use following abbreviations for different intensity correction and standardization methods: ``B'' = bias correction, ``H'' = histogram equalization, ``M'' = histogram mapping.

To train a deep convolutional network, we adopted the design of \citet{bib:Xiao2018}. However, our architecture was simpler as we did not use some of the advanced features, such as auxiliary outputs or augmentation during test phase. The dropout rate was set to 0.1, the batch size was set to 4, and we used geometric data augmentation (flipping, rotation by 90$^\circ$, 180$^\circ$ and 270$^\circ$) at the training phase. The network was trained for 50 epochs, using the weighted sum of the binary cross-entropy and the Jaccard index as the loss function. All frames were downsampled to the size of $256\times256$ pixels prior to training. For each data set, 15\% of the annotated frames were reserved for testing, 15\% --- for validation, and the remaining 70\% were used for training. For each repetition of the training experiment, we generated a separate data split using different random seeds. Consequently, we processed these data with each of the described preprocessing methods, in turn.

\subsection{Intra-set intensity correction}
\label{ssec:intra-stitch}

The goal of this experiment is to develop an approach for reducing intensity variation within a single data set. Such a method should potentially be able to perform both intensity scaling and bias (field inhomogeneity) correction. For this, we extended the Coherent Local Intensity Clustering (CLIC) method \cite{bib:Li2009} that was developed for correcting magnetic resonance data.
To access the quality of intensity correction, we used information from the corresponding overlapping regions of two neighbouring frames of five test data sets. More precisely, we selected the absolute difference between the means of the two overlapping regions and the Jeffrey divergence \cite{bib:Jager2009} between the histograms of these regions as our validation measures.

CLIC \cite{bib:Li2009} is an elegant framework that allows performing intensity inhomogeneity correction based on a very limited set of assumptions that: 1)~intensity content of every frame is modelled as a combination of a finite number of classes, each having a distinct intensity distribution; and 2)~the bias field is smooth. Following their formalism and adding novel linear intensity correction terms (shift $\a$ and scaling $\b$) to the model, we represent every acquired frame $I_t(\x)$ ($t=\overline{1,N}$) as:
\begin{equation*}
I_t(\x) = a_t + b_tB(\x)J_t(\x) + n_t(\x),
\label{eqn:eqn1}
\end{equation*}
where $J_t(\x)$ is the true intensity of the frame; $B(\x)$ is a smooth bias field; $\b=\{b_t\}$ and $\a=\{a_t\}$ are the slope and the intercept of the linear intensity correction function; $n_t(\x)$ denotes additive noise; $N$ is the total number of frames; and $\x$ is a vector of 2D Cartesian coordinates. Note that the bias field is assumed to be the same for all frames as it should represent imperfection of the imaging device. Conversely, the slope $a_t$ and the intercept $b_t$ of the intensity correction function, modelling intensity shift and scaling, respectively, are constant for every frame $I_t(\x)$.

Denoting the target intensity of each of the three classes observed as peaks on the intensity histogram of the data set as $\c=\{c_{i}\}$ ($i=\overline{1,3})$, we arrive at the following energy function:
\begin{equation*}
\mathcal{J}^{loc}_{\x}(U,\a,\b,\c,B) \triangleq {\sum\limits_{t=1}^{N}\sum\limits_{i=1}^{3}}\int\limits_{\mathcal{O}_{\x}}{u_{t;i}^q(\y)K(\x-\y)\left|\mystrut(5)I_t(\y)-a_t-b_t c_{i} B(\x)\right|d\y}.
\label{eqn:eqn2}
\end{equation*}
Here $q$ is a real weight (we used $q=2$ in all our experiments); $U=\{u_{t;i}(\x)\}$ is the class membership function defining probability of each particular pixel belonging to the corresponding intensity class and $K(\x)$ is the truncated Gaussian kernel defined on the neighbourhood $\mathcal{O}_{\x}$. For the strict definition of these parameters and a more detailed explanation about them we refer the readers to the original publication by \citet{bib:Li2009}.


Minimizing $\mathcal{J}^{loc}$ with respect to the variables $\a$, $\b$ and $B$ results in the following equations for the corresponding parameters:
%\begin{equation}
\begin{align}
a_{t=\overline{1,N}}&=\frac{\displaystyle \sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)I_t(\x)\left[\mystrut(5)c_i(K*B(\x))-b_tI_t(\x)\right]d\x}}
{\displaystyle \sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)d\x}},\nonumber\\
\displaystyle b_{t=\overline{1,N}}&= \frac{\displaystyle \sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)I_t(\x)\left[\mystrut(5)c_i(K*B(\x))-a_t\right]d\x}}
{\displaystyle \sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)I_t^2(\x)d\x}},\label{eqn:eqn3}\\
B &= \frac{K*\left(\displaystyle \sum\limits_{t=1}^{N}\sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)c_i\left[\mystrut(5)a_t + b_tI_t(\x)\right]d\x}\right)}
{K*\left(\displaystyle \sum\limits_{t=1}^{N}\sum\limits_{i=1}^{3}\int_{\mathcal{O}_{\x}}{u^q_{t;i}(\x)c_i^2d\x}\right)}.\nonumber
\end{align}
%\end{equation}
Note that, for simplicity, we kept the values of the target intensity of each class (assuming the normalized data) fixed and equal to $\c=[0,\overline{\{I_t(\x)\}},1]$. Model (\ref{eqn:eqn3}) was solved iteratively for 10 iterations, which was enough to ensure convergence. For analyzing influence of the smoothness of the bias field, we performed another experiment by progressively reducing the size of the truncated Gaussian kernel \cite{bib:Li2009} from 2048 to \nicefrac{1}{2} of this value, \nicefrac{1}{3}, \nicefrac{1}{4}, and so on. The results (not shown) indicated that starting from the standard deviation value of 2048/3 the results effectively did not change. Based on this observation, this value was selected for all further experiments.

Estimated bias fields for each of the five data sets exhibit high degree of resemblance, as illustrated in Figure~\ref{fig:estim_bias}. We have also considered simpler versions of the derived model (\ref{eqn:eqn3}) by setting one or two out of the three intensity correction factors ($\a$, $\b$, $B$) to their default values. Results of this experiment, illustrated in Figure~\ref{fig:ResultsInter} for the Jeffrey divergence, indicate that using sole bias correction significantly outperforms all other approaches and improves homogeneity of separate frames and similarity between the neighbouring ones.
\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure2}
\centering
\caption{\small Estimated intensity inhomogeneity (bias) field for each of the five data sets. }
\label{fig:estim_bias}
\end{figure*}

\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure3}
\centering
\caption{\small Results of intra-set intensity correction by different methods measured in terms of the Jeffrey divergence in the overlap regions. The value axis is presented in the logarithmic scale. Lower values indicate better performance. Here ``B'' is bias correction, and ``S'' and ``T'' denote modelled intensity scaling (slope) and shift (intercept), respectively. Whiskers of the boxplot indicate the maximum and the minimum value, respectively.}
\label{fig:ResultsInter}
\end{figure*}

\subsection{Intra-set intensity standardization}
\label{ssec:intra-stitch-scaling}

In this experiment, we analyzed influence of the described bias correction and intensity scaling by simple histogram equalization \cite{bib:KimP08} on training capability using a single data set. As the training data, we selected the 2929Q1 data set that has the largest amount of annotated data, both in terms of the number of frames and the number of mitochondria. The Jaccard index on the training data was used as the quality measure. Each training experiment was repeated twelve times and the distributions of the calculated results are shown in Figures~\ref{fig:onestitch} and \ref{fig:example}.

\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure4}
\centering
\caption{\small Distribution of the average Jaccard index on the test set for network trained on a single data set (2929Q1; left) and  all five data sets combined (right). The network was trained twelve times on the data preprocessed by each method, every time with a different randomization seed for data splitting. Here ``B'' denotes bias correction, ``H'' --- histogram equalization, and ``M" --- histogram mapping via exact histogram specification approach. Whiskers of the boxplot indicate the maximum and the minimum value, respectively.}
\label{fig:onestitch}
\end{figure*}

\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure5}
\centering
\caption{\small Representative example of the training results on 2922Q1 data set with significant performance improvement resulting from applied bias correction. The network was trained on the images preprocessed by different methods. In the bottom row, the ground truth and the actual segmentation are shown in complementary colors: blue and orange, respectively, such that the regions where they overlap appear white. The numbers indicate the corresponding value of the Jaccard index. The length of the scale bar is 1~\textmu{m}.}
\label{fig:example}
\end{figure*}

\subsection{Inter-set intensity standardization}
\label{ssec:inter-stitch-scaling}

Finally, we repeated the previous experiment on all the five data sets combined together. In~addition to the previously described bias correction and histogram equalization, we also applied an exact histogram specification technique \cite{bib:Coltuc2006} to map the histogram of each data set to the corresponding histogram of 2929Q1. The intensity transformation curve was calculated for the entire set of frames belonging to the corresponding data set, and, consequently, intensity of each particular image was modified using the calculated mapping. Results of this experiment are shown in Figures~\ref{fig:onestitch} and \ref{fig:results}.

\begin{figure*}[!t]
\includegraphics[width=\textwidth]{Figure6}
\centering
\caption{\small Representative examples of the training results on all five data sets combined. The~network was trained on the images preprocessed by ``B+M'' method: bias correction with subsequent mapping using the exact histogram specification method. Contrast of the images (top row) was increased for visualization purposes. In the bottom row, the ground truth and the actual segmentation are shown in complementary colors: blue and orange, respectively, such that the regions where they overlap appear white. The numbers indicate the corresponding value of the Jaccard index. The length of the scale bar is~1~\textmu{m}.}
\label{fig:results}
\end{figure*}

\section{Discussions and Conclusions}

Problem of standardizing the data is of paramount importance for successful training of a neural network on data originating from different sources and is generally referred to as \textsl{domain adaptation}. Together with development of more robust training algorithms, it should enable successful application of deep learning approaches to large data arrays exhibiting high degree of diversity. Here we developed and analyzed methods for correcting and standardizing image intensity on the level of a single frame, a single data set and multiple data sets. We demonstrated, in particular, that developed bias correction approach (see Section~\ref{ssec:intra-stitch}) has positive effect on training results; see Figure~\ref{fig:onestitch}(left) in Section~\ref{ssec:intra-stitch-scaling}. In this approach, the bias field is modelled as imperfection of the acquisition hardware that results in uneven illumination and is derived from the data by making a sole assumption about it being smooth. Quite surprisingly, a simple bias correction model, without additional intensity modification, outperformed all other approaches by far, as shown in Figure~\ref{fig:ResultsInter}. An~illustrative example of the benefits of performing the proposed bias correction is given in Figure~\ref{fig:example}.

Next, we considered the problem of standardizing image intensity for training a neural network on a combined data set consisting of multiple data sets; see Section~\ref{ssec:inter-stitch-scaling}. It is important to note that accounting for the difference in image intensity between training and test data sets can also be approached by augmenting the training data. In this work, we did not use this possibility as our goal was to investigate the impact of preprocessing methods on the training performance. The results of this experiment are summarized in Figure~\ref{fig:onestitch}(right). Several important conclusions can be drawn from this figure. First, training performance on the aggregate data set, consisting of all five data sets combined, is considerably lower than that on a single data set with sufficient amount of annotated data. Such performance decrease is explained by high data variability and minimization of this effect is the main goal of this research. Second, none of the analyzed methods resulted in clearly superior performance. Third, exact histogram specification mapping \cite{bib:Coltuc2006} produces the best overall results, with results of applying this method on raw and bias-corrected data being very similar. Fourth, applying the histogram equalization technique \cite{bib:Xiao2018}, commonly used for this purpose, clearly deteriorates the results.

Figure~\ref{fig:results} illustrates typical segmentation results on each of the five data sets on which our network was trained. Although the overall segmentation performance is quite good, this figure confirms that the difference in the amount of available training data per data set results in different performance; compare e.g. 2929Q1 to 2922Q1 and/or 2929Q4. Presence of orange-colored regions in some of the images in the bottom row also reflects inconsistencies in our ground-truth annotation as these objects were not annotated. As the next step in this project, we are planning to perform quality control on our manual annotations, which should further improve the segmentation results.

\section*{Acknowledgments }

We thank our colleagues Ian Alwayn, Asel Arykbaeva and Dorottya de Vries from the Department of Surgery (LUMC) for providing the tissue samples.

\bibliography{dzyubachyk21}

\end{document}

