\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{hyperref}
\usepackage{xpatch}

\xpatchbibmacro{date+extradate}{%
  \printtext[parens]%
}{%
  \setunit{\addperiod\space}%
  \printtext%
}{}{}

%\usepackage{mwe} % to get dummy images
\jmlrvolume{-- Under Review}
\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021 submission}
\editors{Under Review for MIDL 2021}

\title[Feature-based image registration]{Feature-based image registration in structured light endoscopy}


% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Andreas M. Kist\nametag{$^{1,2,*}$}}\footnotetext[1]{These authors contributed equally.} \Email{andreas.kist@fau.de}\\
\addr $^{1}$Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-University Erlangen-Nürnberg, Germany 
\AND
\addr $^{2}$Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head- and Neck surgery, University Hospital Erlangen, Friedrich-Alexander-University Erlangen-Nürnberg, Germany
\AND
\Name{Julian Zilker\midlotherjointauthor\nametag{$^{2,*}$}} \Email{julian.zilker@gmx.de}\\
\Name{Michael Döllinger\nametag{$^{2}$}} \Email{michael.doellinger@uk-erlangen.de}\\
\Name{Marion Semmler\nametag{$^{2}$}} \Email{marion.semmler@uk-erlangen.de}}


\begin{document}


%\footnotetext[3]{Contributed equally as senior authors}

\maketitle

\begin{abstract}
Images offer a two-dimensional (2D) representation of a three-dimensional (3D) environment. However, in many biomedical tasks, a 3D view is crucial for diagnosis. Projecting structured light, such as a regular laser grid, onto the surface of interest allows to reconstruct its 3D structure. For reconstruction, it is crucial to correctly identify and assign each laser ray to its respective position in the laser grid. Current methods for this task use semi-automatic, yet highly manual annotations. Hence, a fully automatic, reliable method is desired. Here, we show that this assignment can be approached as an image registration. After separating the laser rays from the background, we found that registration of the extracted laser rays directly to the fixed laser grid image fails, when we use state-of-the-art intensity-based image registration techniques, such as the Advanced Normalization Tools (ANTs). Using our feature-based custom loss and a deep neural network, we are able to use a U-Net-like architecture to compute deformation fields to successfully register the laser rays onto the fixed image accompanied with a custom post-processing assignment step. Using synthetic data, we show that the network is in general able to learn affine and non-linear transformations. Our method is also robust to missing or occluded rays. Using an \emph{ex vivo} dataset, we achieved a registration accuracy of 91\%. In summary, we provide a new platform to perform feature-based registration and showcase this on a biomedical dataset. In the future, we will evaluate different architectural designs and more complex datasets. 
\end{abstract}

\begin{keywords}
Image registration, deformation field, structured light, endoscopy
\end{keywords}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% INTRODUCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

\begin{figure}[htb]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:workflow}
  {\caption{Structured light endoscopy. (a) An endoscope consisting of a camera and a laser projection unit. Laser rays produce a point grid that is directed onto the surface of interest and visible in the camera image. (b) The extracted laser rays have to be assigned to a unique grid position in the plane laser grid for valid 3D reconstruction. $n_c$ and $n_r$ describe the number of columns and rows, respectively. (c) 3D reconstruction by inferring the depth information $z$ by determining the distance in $x$ (and $y$, not shown for clarity) compared to the reference given a fixed camera to laser angle $\alpha$. }}
  {\includegraphics[width=\linewidth]{Figures/workflow.pdf}}
\end{figure}

A three-dimensional (3D) view is highly important for an accurate biomedical diagnosis and treatment in many areas, such as MRI for strokes and CT for bone fractures. However, multiple professions still rely on imaging procedures that use conventional cameras, such as laryngeal endoscopy. Here, only a 2D view onto the larynx is available \citep{andrade2020laryngeal, deliyski2010state}. However, laryngeal endoscopy is the gold standard to assess the health state of a subject's voice \citep{mehta2012current}. A healthy voice features a symmetric, homogeneous movement of the vocal folds in three dimensions \citep{titze1998principles}, however, with the aforementioned technique, the examiner has only limited access to this information. Therefore, there is a drive to develop 3D endoscopic techniques \citep{luegmair2010optical,ghasemzadeh2020method,semmler20163d, schmalz2012endoscopic}. 

Recently, we and others showed that laryngeal endoscopy using structured light \citep{geng_structured_light} is feasible for 3D reconstruction of the vocal folds \citep{ghasemzadeh2020method,semmler20163d, luegmair2010optical, luegmair2015three}. By using laser rays in a fixed angle $\alpha$ in relation to the endoscope (\figureref{fig:workflow}(a,c)), we can use triangulation for each laser ray to reconstruct the 3D surface. Briefly, in a calibrated laser grid, the depth is computed by the relative distance of each laser ray to its reference position (\figureref{fig:workflow}(c)). The key issue here is the laser ray extraction and correct individual assignment, which remains very challenging and requires a significant amount of manual effort \citep{semmler2017endoscopic}. Only recently, there are first reports using deep learning in structured light endoscopy \citep{li2019depth,ma2019structured}.

In general, we hypothesized that this task can be approached as an image registration procedure \citep{hill2001medical} between a fixed (the ideal grid) and a moving image, i.e. the laser rays extracted from an endoscopic image (\figureref{fig:workflow}(b)). There exist multiple registration platforms, such as the Computational Morphometric Toolkit (CMTK, \citealt{rohlfing2011user}) and the Advanced Normalization Tools (ANTs,  \citealt{avants2009advanced}) to register biomedical image data, such as individual MRI brain scans to a brain atlas. In general, deformable medical image registration is a complex, fast forward moving field with many strategies and applications to achieve a good registration \cite{sotiras2013deformable}, for example the use of phantoms to align different image modalities \cite{rodriguez2008pet}. Recently, a large body of deep neural networks have been used for (biomedical) image registration \citep{fan2019birnet,jaderberg2015spatial,yang2017quicksilver,krebs2019learning}, which have been recently summarized \citep{haskins2020deep}. Recent advances utilize the prediction of deformation fields using U-Net-like architectures \cite{Balakrishnan_2018_CVPR}, which can be combined with general adversarial networks \cite{mahapatra2018deformable}. Most of the architectures mentioned, however, were evaluated on tomographic images that contain a high-level structure that is non-repetitive. Further, these methods align images typically by minimizing an intensity-based metric. Although recent point-registration deep neural networks exist, they were only evaluated on point clouds that contain special features \citep{aoki2019pointnetlk} or find only rigid transformations \citep{wang2019deep}. In contrast, the laser grids are highly regular and repetitive, and are potentially hard to align only based on intensity and contain likely non-rigid transformations.

In this work, we are investigating if laser grid maps are able to be registered onto an ideal grid using well-established tools (ANTs) and deep neural networks. We test if intensity-based registration is able to accurately map individual laser rays and if a feature-based approach is more suitable. We develop therefore a custom loss function and test the registration ability on a synthetic dataset. We show the general applicability and performance on \emph{ex vivo} data of oscillating vocal folds.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% METHODS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Methods}

We provide the code for this study on GitHub \url{https://github.com/julzil/endolas}.

\subsection{Synthetic dataset}
\label{sec:synth}
We generated an evenly spaced laser grid containing 25 keypoints organized in five columns and five rows (5$\times$5). The image size was 224$\times$224 px for initial experiments to test what the network is capable to learn and for optimization strategies (as shown in \figureref{fig:synthetic_results}). We generated three sets containing each 4800 synthetic images plus one fixed image (no perturbations). For each set, we randomly applied a different set of perturbations (as shown in \figureref{fig:synthdata}): The first set used affine transformations only (i.e. translation, rotation, shear, and scaling), the second one used affine and non-linear transformations (i.e. applying a sine), and the third one also included random keypoint dropout to mimick hidden, occluded or not extracted keypoints with a probability of 0.2. The sets were randomly split into 70\% training, 15\% validation and 15\% test set. For \emph{ex vivo} experiments, we adjusted the synthetic dataset accordingly:  We changed the resolution to 768$\times$768 px and used 324 keypoints arranged in an 18$\times$18 grid. 

\subsection{\emph{Ex vivo} dataset}

Calf larynges were obtained from the local slaughter house and prepared as previously described (reviewed in \citealt{dollinger2011experiments}). All footage was recorded using a Photron Highspeed Camera at 4,000 fps at a resolution of 768$\times$768 px. A 532 nm Nd:YAG laser was used for laser grid generation. The collimated laser light was focused via an 18$\times$18 micro-lens array to yield a focal plane at around 80 mm below the light outlet. The laser grid was calibrated using a custom calibration script in MATLAB \citep{semmler20163d}. We analyzed twelve videos in total, each recorded on a unique calf larynx. Out of the twelve videos, eight videos belonged to the training, two to the validation and two to the test set. Each video contained twenty fully annotated frames. Therefore, the training set contained 160, the validation 40 and the test set 40 frames. In each frame, all visible keypoints were annotated manually and assigned to a unique grid position (\figureref{fig:workflow}(b)) for evaluation. These annotations were used as ground truth.

\subsection{Image generation}

Keypoints and their $(x,y)$ location were either known by creation (synthetic dataset, see \ref{sec:synth}) or manually annotated (\emph{ex vivo} dataset). The latter is typically retrieved automatically in a working environment using semantic segmentation of the laser rays, e.g. using a U-Net, with subsequent 2D peak finding. For each keypoint, we drew a small circle with a certain radius using the keypoint coordinates as centroid onto an initially black image with the same dimensions as the original image. We further added a Gaussian blur version of the image to incorporate a low-level structure to the image. This approach resulted in images as shown in \figureref{fig:workflow}, Moving image. 

\subsection{Neural network architecture}
\label{sec:nn}
We trained a modified U-Net architecture \citep{ronneberger2015u} using $k=32$ base filters implemented in TensorFlow/Keras (implementation from \citealt{gomez2020bagls}) in v.2.2.0 and 2.3.0, respectively. An overview is given in \figureref{fig:unet}. Each block contained a Conv2D layer with a 3$\times$3 kernel, followed by a BatchNorm-Layer and a ReLU activation (similar to \citealt{cciccek20163d}). The final layer contained a 1$\times$1 Conv2D layer with linear activation with two filters, resulting in two maps $u_x(\mathbf{x})$ and $u_y(\mathbf{x})$, for $x$ and $y$ translation, respectively. As input we used either only the moving image, the fixed and the moving image, or the moving image, the gradient and the difference image, as suggested by \citep{fan2019birnet}, see also \tableref{tab:synthetic}. The input image size was 224$\times$224 px and 768$\times$768 px for synthetic and \emph{ex vivo} data set, respectively. The fixed image contained an evenly spaced 5$\times$5 (synthetic) or 18$\times$18 (\emph{ex vivo}) grid (see example in \figureref{fig:workflow}(b)). 
For the \emph{ex vivo} dataset, we applied in some experiments several augmentations for each epoch using the \emph{albumentations} package \citep{info11020125}: Images were randomly varied in brightness, contrast, gamma and blur. Additionally, images were flipped and rotated.
In all experiments, we trained the architecture for 100 epochs using the Adam optimizer together with a constant learning rate of $10^{-3}$. The training and inference was performed on a machine equipped with an NVIDIA RTX 2080 Ti (11 GB) graphics card.

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:unet}
  {\caption{Example neural network architecture for feature-based image registration predicting displacement maps $u_x$ and $u_y$ using a fixed $f(\mathbf{x})$ image and a moving $m(\mathbf{x})$ image (configuration I2 from \tableref{tab:synthetic})}}
  {\includegraphics[width=\linewidth]{Figures/unet_registration.pdf}}
\end{figure}

\subsection{Displacement maps and feature-based loss function}

For each pixel $k$, warped coordinates $x_w^k$ and $y_w^k$ are computed as a function of the displacement maps $u_x$ and $u_y$ and the coordinates of the input image $x_m^k$ and $y_m^k$:

\begin{equation}
    x_w^k = x_m^k + u_x(\mathbf{x}^k),
\end{equation}
\begin{equation}
    y_w^k = y_m^k + u_y(\mathbf{x}^k).
\end{equation}

As no ground-truth deformation maps are available, we use a feature-based metric emphasizing the correspondence of warped and fixed laser rays (keypoints). We therefore minimize the distances between warped and fixed keypoints by knowing the exact location in the grid $x_f^k$ and $y_f^k$. We first compute the Euclidean distance $d^k$ for each keypoint $k$ (\equationref{eq:distance}, in $px$), compute the mean squared Euclidean distance (MSED) for each image for $n$ keypoints (\equationref{eq:msed}, $px^2$) and for each batch consisting of $N$ images (\equationref{eq:batchmsed}).

\begin{equation}
    \label{eq:distance}
    d^k = \sqrt{(x_w^k - x_f^k)^2 + (y_w^k + y_f^k)^2}
\end{equation}

\begin{equation}
    \label{eq:msed}
    MSED = \frac{1}{n} \sum_k^n (d^k)^2
\end{equation}

\begin{equation}
    \label{eq:batchmsed}
    \epsilon_{MSED} = \frac{1}{N} \sum_i^N MSED_i
\end{equation}

Taken together, we minimize the MSED across the batch to train the network to predict highly accurate, feature-based displacement maps.

\subsection{Grid classification and evaluation metric}
\label{seq:eval}
After the image registration, the points were assigned a grid position using a heuristic similar to a nearest neighbour search. As our task is bijective, i.e. one keypoint can only be assigned to one grid position, we assign the keypoint to that position in the grid where it is globally the closest. This iterative search results in a unique assignment map. We provide a structogram in the Appendix \ref{fig:nn}. After we observed that in some cases the points were not assigned in order, we used prior knowledge, e.g. that 3 follows 2 and that 2 does not follow 3, to implement a custom bubble sort algorithm to ensure that points are sorted logically in the grid.
For evaluation, we investigate how close the registered points are to their respective, ideal grid location using the mean Euclidean distance in $px$. We further determine the assignment accuracy, i.e. the fraction of points that were assigned correctly in the grid.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% RESULTS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments and Results}

Using the generated images, we first investigated if intensity-based algorithms, such as ANTs, were able to closely register any frame to the plane laser grid image. We found that the results were very insufficient (see Appendix \ref{fig:ants} for exemplary image). We therefore asked if a deep neural network-based approach is suitable for a feature-based image registration.

\subsection{An encoder-decoder network is able to learn feature registration using displacement maps}

We next investigated which transformations could be learned using a synthetic dataset of 25 keypoints arranged in five rows and five columns (see also \ref{sec:synth}). The network should be especially robust to highly non-linear, i.e. non-rigid, transformations and missing keypoints, as this is common in endoscopic footage. We therefore tested a series of transformations (\tableref{tab:synthetic}, T1-T3). 

\begin{table}[htbp]
    \footnotesize
    \centering
    \begin{tabular}{|l|c|c|c|c|}
        Id & Deformation &  Input Images & \multicolumn{2}{c|}{Figure}  \\
        \hline
        \hline
        T1 & \bf{Affine}& Moving &  \ref{fig:synthetic_results}(c) & \\
        T2 & \bf{Affine} + \bf{Non-linear}  & Moving & \ref{fig:synthetic_results}(c) & \\
        T3 & \bf{Affine} + \bf{Non-linear + Dropout}  & \bf{Moving} &\ref{fig:synthetic_results}(c) & \ref{fig:synthetic_results}(d) \\
        I2 & Affine + Non-linear + Dropout & \bf{Moving + Fixed} & & \ref{fig:synthetic_results}(d)\\
        I3 & Affine + Non-linear + Dropout & \bf{Moving + Difference + Gradient} &  & \ref{fig:synthetic_results}(d)\\
    \end{tabular}
    \caption{Overview of different training settings for training the synthetic dataset. }
    \label{tab:synthetic}
\end{table}

As shown in \figureref{fig:synthetic_results}(a), all tested transformations can be learned by our network architecture. We observe after roughly 50 epochs a steady state where the loss marginally decreases in the training and validation dataset. As the fixed image is constant across experiments and individual images, we found that only feeding the moving image is sufficient for network convergence. Intruiged by previous studies \citep{fan2019birnet,jaderberg2015spatial}, we tested different input strategies to improve the network performance. We found that feeding more information is in general beneficial for network performance (\figureref{fig:synthetic_results}(b)). The median of the mean Euclidean distance (MED) in the laser grid for each keypoint was 1.41 px, 0.46 px and 0.68 px (T3, I2 and I3, respectively), where condition I2 yielded the best results and the narrower distribution (\figureref{fig:synthetic_results}(c)). We also tested in our studies the loss (MSED vs. MED) and found equal convergence behavior. Regularization ($L_1$ or $L_2$) impaired the network convergence and yielded worse results. In summary, we found that the assignment accuracy was across multiple settings at around 97-99\%. In Appendix \ref{fig:dist_synth}, we provide an overview of the registration accuracy across images.

\begin{figure}[htb]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:synthetic_results}
  {\caption{Feature-based registration on a synthetic dataset. (a) MSED loss of training (solid line) and validation (dashed line) set across training epochs for different transformation settings (T1-T3). (b) MSED loss of training (solid line) and validation (dashed line) set across training epochs for different inputs (T3, I2, I3). Same y-axis as in panel (a). (c) Distribution of mean Euclidean distances (MEDs) for different inputs on the validation set (T3, I2, I3).  }}
  {\includegraphics[width=\linewidth]{Figures/synthetic_transformations_input.pdf}}
\end{figure}

\subsection{Feature-based registration performs well on real, highly non-linear deformations}

We next investigated if larger image sizes, more keypoints and under real circumstances, the network still performs equally well compared to the synthetic dataset. We used \emph{ex vivo} recordings of calf larynges where an 18$\times$18 laser grid was projected onto and extracted the keypoints manually (see Methods and \figureref{fig:workflow}(b)). Similarly, we applied the same architecture as described for the synthetic dataset and trained for 100 epochs (\figureref{fig:exvivo_results}(a)). We found that training only on the ground-truth data revealed a high median MED on the validation set and a low assignment accuracy (see \ref{seq:eval}) of 51\% on the validation dataset (\figureref{fig:exvivo_results}(b,c)). We further evaluated the network's performance on the validation set if purely trainined with adequate synthetic data (same resolution and grid dimensions, see \ref{sec:synth}). Here, we also yielded high median MEDs and a low accuracy of 46\% (\figureref{fig:exvivo_results}(b,c)). Interestingly, when combining the ground-truth and the synthetic data, we leveraged the performance and gained lower median MEDs of 9.8 px and an accuracy of 72\%. We found the largest boost when applying intense augmentations (see \ref{sec:nn}) and increasing the variety 20-fold. In this case, the MED variance was very low (\figureref{fig:exvivo_results}(b)) and the MED value at 5.13 px. As the grid spacing is 16 px, this suggests an accurate nearest neighbour sorting. Indeed, the accuracy is around 91\% on the validation set (\figureref{fig:exvivo_results}(c)). Noteworthy, the highest uncertainty in prediction is where keypoints are missing close to the glottis (see \figureref{fig:workflow}(b) and Appendix \ref{fig:dist_exvivo}).

\begin{figure}[htb]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:exvivo_results}
  {\caption{Feature-based registration on an \emph{ex vivo} dataset. (a) MSED loss of training (solid line) and validation (dashed line) set across training epochs for different input data. (b) Distribution of mean Euclidean distances (MEDs) on the validation set for different input data.
  (c) Mean accuracy on validation dataset. }}
  {\includegraphics[width=\linewidth]{Figures/exvivo.pdf}}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% DISCUSSION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion and Conclusion}

In this study we suggest a U-Net-like architecture that uses a moving and a fixed image of laser points in rectangular grid, respectively, to compute a deformation field to register keypoints based on their identity. We show that the network is able to learn affine and highly non-linear transformations, and is capable of coping with a large fraction of missing keypoints. However, we have not systematically addressed yet how many missing keypoints can be tolerated by our approach. We found that a larger gap of keypoints result in higher assignment variation, whereas we still found high assignment accuracies of over 91\%, which is only slightly lower compared to our toy dataset (97-99\%).

We also found that training solely on synthetic data is almost as good as training on only ground-truth data, and a blend of synthetic and ground-truth data enhances the registration and thus, the assignment accuracy (\figureref{fig:exvivo_results}). Still, we believe that the non-linear transformations in the \emph{ex vivo} data are not fully represented in our synthetic dataset. Further investigations about the non-linear transformations may help in developing better strategies to generate more realisitc synthetic data. 

The data used in this study was manually annotated to evaluate the core idea of using feature-based registration in structured light endoscopy. The extraction accuracy may also have an impact on the registration, as one could potentially miss and/or identify additional keypoints at wrong locations impacting the registration and the assignment. In the future, we will address the complete workflow to uncover sources of error propagation.

In summary, our results suggest that our presented feature-based registration method is highly valuable in structured light endoscopy, such as 3D laryngeal endoscopy, and together with keypoint extraction a potentially fully automatic data analysis technique.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% CONCLUSIONS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section{Conclusions}

%In this study we suggest a U-Net-like architecture that uses a moving and a fixed image of laser points and a rectangular grid, respectively, to compute a deformation field to register keypoints based on their identity. We show that the network is able to learn affine and highly non-linear transformations, and is capable of coping with a large fraction of missing keypoints. We used a synthetic and an \emph{ex vivo} dataset to show the general applicability. Our architecture achieved accuracies of 99\% and 91\% for these datasets, respectively. Our results suggest that this method is highly valuable in structured-light based endoscopy, such as 3D laryngeal endoscopy.



% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{Andreas M Kist was supported by a fellowship of the Joachim Herz foundation. Part of this work was supported by the German research foundation (DFG) under the grant no DFG DO1247/9-1}


\bibliography{midl-samplebibliography}


\newpage
\appendix

\renewcommand\thefigure{\thesection.\arabic{figure}}    
\setcounter{figure}{0}    

\section{Intensity-based image registration}

We used ANTs with various settings to register the extracted keypoints to the fixed image (regular laser grid). However, no settings resulted in satisfying results (see \figureref{fig:ants}).

\begin{figure}[h]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:ants}
  {\caption{Failed registration using ANTs. (a) morphed image, (b) deformation field.}}
  {\includegraphics[width=\linewidth]{Figures/failed_registration.pdf}}
\end{figure}


\section{Synthetic data generation - example images}
\setcounter{figure}{0}    

\begin{figure}[htb]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:synthdata}
  {\caption{Synthetic dataset. Fixed image and examples of images with affine transformation (used in configuration T1),  with affine and non-linear transformation (T2), and with affine and non-linear transformations with random dropout (T3, I2, I3). }}
  {\includegraphics[width=\linewidth]{Figures/synth_data.pdf}}
\end{figure}


\newpage

\section{Bijective nearest neighbour search}
\setcounter{figure}{0}    

\begin{figure}[h]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:nn}
  {\caption{Structogram of bijective nearest neighbour search}}
  {\includegraphics[width=\linewidth]{Figures/nearest_neighbor.pdf}}
\end{figure}

\newpage

\section{Registration accuracy in the synthetic test dataset}
\setcounter{figure}{0}    

\begin{figure}[h]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:dist_synth}
  {\caption{Registration accuracy in the synthetic test dataset. (a) MED distribution of validation (gray) and test (red) dataset. (b) Registration accuracy across images, color-coded for each keypoint. (c) Example registration of moving image (red dots) to fixed image (white grid). Warped keypoints in green. }}
  {\includegraphics[width=\linewidth]{Figures/7_test.pdf}}
\end{figure}

\newpage

\section{Registration accuracy in the \emph{ex vivo} test dataset}
\setcounter{figure}{0}    

\begin{figure}[h]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:dist_exvivo}
  {\caption{Registration accuracy in the \emph{ex vivo} test dataset. (a) Registration accuracy across images, color-coded for each keypoint. Note the uncertainty in the center. There, the moving glottis is located and many keypoints are (at least partly) missing. (b) Example registration of moving image (reds dots) to fixed image (white grid). Warped keypoints in green. }}
  {\includegraphics[width=\linewidth]{Figures/4-spatial.pdf}}
\end{figure}

%This is a boring technical proof of
%\begin{equation}\label{eq:example}
%\cos^2\theta + \sin^2\theta \equiv 1.
%\end{equation}

% \section{Proof of Theorem 2}

% This is a complete version of a proof sketched in the main text.

\end{document}
