\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{footnote}
%\makesavenoteenv{tabular}
%\makesavenoteenv{table}
%\usepackage{mwe} % to get dummy images
%\usepackage{makecell}
%\usepackage{multirow}

\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}

\newcommand{\etal}{{\em et al.}\,}
\newlength\mylen
\newcommand\myinput[1]{%
  \settowidth\mylen{\KwIn{}}%
  \setlength\hangindent{\mylen}%
  \hspace*{\mylen}#1\\}

\title[Foveated Pyramid Attention]{Locating Cephalometric X-Ray Landmarks with Foveated Pyramid Attention}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
\midlauthor{\Name{Logan Gilmour} \Email{legilmou@ualberta.ca} \and \Name{Nilanjan Ray} \Email{nray1@ualberta.ca}\\
\addr Department of Computing Science, University of Alberta}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}



\begin{document}

\maketitle

\begin{abstract}

CNNs, initially inspired by human vision, differ in a key way: they sample uniformly, rather than with highest density in a focal point. For very large images, this makes training untenable, as the memory and computation required for activation maps scales quadratically with the side length of an image. We propose an image pyramid based approach that extracts narrow glimpses of the of the input image and iteratively refines them to accomplish regression tasks. To assist with high-accuracy regression, we introduce a novel intermediate representation we call `spatialized features'. Our approach scales logarithmically with the side length, so it works with very large images. We apply our method to Cephalometric X-ray Landmark Detection and get state-of-the-art results.

\end{abstract}

\begin{keywords}
Deep learning, Landmark detection, Attention mechanism, Convolutional Neural Network, 2D X-ray cephalometric analysis, Image pyramid
\end{keywords}

\section{Introduction}
Convolutional Neural Networks (CNN), though initially inspired by human vision \cite{fukushimaNeocognitronHierarchicalNeural1988} \cite{lecunBackpropagationAppliedHandwritten1989}, are different from human vision in an important way: human vision has its highest density in the center (the fovea) \cite{javiertraverReviewLogpolarImaging2010}, while conventional CNNs sample uniformly from their input. This means that a $512 \times 512$ input to a CNN is considered high-resolution \cite{tanEfficientNetRethinkingModel2019}, as scaling up the side length of an input image increases the computation and memory requirements quadratically. This distinction is especially relevant in spatial regression tasks, where downsampling input images will likely reduce accuracy.

The human approach to spatial regression is much more efficient. A person looking at an image on a computer screen is only processing a tiny fraction of the pixels at the highest retinal resolution (in the fovea), with sample density falling off exponentially away from the point of focus \cite{guenterFoveated3DGraphics2012}. Given that human-level performance is the benchmark for many computer vision tasks, non-uniform sampling looks promising.

To this end, we propose a new approach for landmark regression that makes the following contributions: (i) An image-pyramid based method to adapt a pretrained CNN to perform non-uniform sampling centered on a focal point, producing many low resolution feature maps, (ii) a dimensionality-reduction technique to convert the resulting feature maps into feature vectors with 2D spatial coordinates (we call these 'spatialized features'), and (iii) an error-feedback approach that iteratively estimates a landmark location from the resulting features.

Our approach scales logarithmically with the side length of the input, so it works with very large images. We test our approach on a reasonably high-resolution dataset of Cephalometric X-ray Landmark Locations \cite{wangBenchmarkComparisonDental2016} and get state-of-the-art results.

\subsection{Cephalometric X-ray Landmark Regression}

Cephalometric landmarks are used in cephalometric analysis to provide angular and linear measurements of a patient's dental, bony, and soft tissue for orthodontic diagnosis and treatment planning \cite{duraoCephalometricLandmarkVariability2015}. A dataset of 400 head-and-neck X-rays labelled with 19 landmarks is publicly available \cite{wangBenchmarkComparisonDental2016} from an ISBI 2015 Grand Challenge. The two accepted entries in the challenge both used variations on random forest regression of Haar features \cite{ibragimov2015computerized, lindner2015fully}. The winners refined their approach in \cite{lindnerFullyAutomaticSystem2016a}, showing state-of-the-art results with four-fold cross validation on all images in the dataset.

Deep learning approaches have only recently become comparable. Two early convolutional approaches directly regressed landmark locations \cite{arikFullyAutomatedQuantitative2017, leeCephalometricLandmarkDetection2017}, and though promising, did not perform as well as \cite{lindnerFullyAutomaticSystem2016a}. An approach using object detector YOLOv3 \cite{redmonYOLOv3IncrementalImprovement2018} reports results closer to Lindner \etal using a private dataset that is approximately three times larger \cite{parkAutomatedIdentificationCephalometric2019}. Another detector-based approach \cite{qianCephaNetImprovedFaster2019} builds on Faster-RCNN \cite{renFasterRCNNRealTime2017} to achieve good results while not advancing the overall state of the art. Two recent methods transform the target into a heatmap prediction task using a fully-convolutional network \cite{payerIntegratingSpatialConfiguration2019, zhongAttentionGuidedDeepRegression2019}. Zhong \etal improves on the average error of Lindner \etal by 5\%.

Notably, Lindner \etal does  `coarse' and `fine' random forest regression vote maximization searches \cite{lindnerFullyAutomaticSystem2016a}, while Zhong \etal uses two U-nets \cite{ronnebergerUNetConvolutionalNetworks2015} to produce `coarse' and `fine' heatmap predictions of landmark locations. The fact that both of the best methods are `coarse-to-fine' suggests that our heavily multi-resolution approach should be well suited.

\subsection{Foveated Sampling}

A body of work based on foveated approaches to vision tasks exists, using approaches like the log-polar transform, the Cartesian foveated geometry, or the reciprocal wedge transform \cite{javiertraverReviewLogpolarImaging2010}. Unfortunately all of these distort space, meaning translation invariance is lost, which is the property convolutional networks are defined by. For especially radical transformations like the log-polar transform (which maps an image radially), we should not expect transfer learning to be effective, which is a major blow (especially for our problem, where the dataset is quite small). Work exists on using these types of transformations with CNNs \cite{jaramillo-avilaFoveatedImageProcessing2019,enlighten148802}, but early experiments we tried with them were not promising.

Instead, we use image pyramids, which often improve performance in convolutional approaches, but are typically only used at inference time (because of high memory costs), or are generated implicitly in-network \cite{linFeaturePyramidNetworks2017a}. We start from an approach taken in Recurrent Models Visual Attention \cite{mnihRecurrentModelsVisual2014}, which constructs a retina-like representation by stacking together small patches (all of the same pixel dimensions) at different scales, so that only a small focal region of an image is processed at full resolution. However, they directly process a `glimpse' (we will use their terminology)  with a recurrent network. We extend their `glimpse sensor' to use a pretrained CNN to process the patches in the glimpse, allowing for larger patch sizes and smaller training data. Also, they perform a classification task, using a reinforcement-learning approach to take the right glimpses to make a decision, whereas we simply learn to regress our glimpse towards the target.

\subsection{CNN Regression}

Object detection is a well known application of CNN regression. Two recent approaches to Cephalometric X-ray Landmark Detection \cite{parkAutomatedIdentificationCephalometric2019, qianCephaNetImprovedFaster2019} are object-detection based. We are particularly interested in Trident Networks \cite{liScaleAwareTridentNetworks2019}. Their innovation is to process three different scales of an image with the same CNN (via dialated convolutions). We adopt a similar approach, processing different resolutions from our glimpse with the same CNN.

Human pose estimation is another area where CNNs are applied to regression. Though the seminal deep-learning approach directly regressed human joint coordinates using a CNN \cite{toshevDeepPoseHumanPose2014}, most recent work instead uses a fully convolutional architecture to learn a heat map centered on the joint, then finds the coordinates of the maximum value of the heatmap \cite{newellStackedHourglassNetworks2016,sunIntegralHumanPose2018, xiaoSimpleBaselinesHuman2018}. Heatmap regression is also used in the current best performing approach \cite{zhongAttentionGuidedDeepRegression2019} to Cephalometric X-ray Landmark Detection. However, taking the maximum means that this method is not end-to-end differentiable in the landmark coordinates. This provides some difficulty for our foveated model, where we need to combine information from many different resolutions to make a prediction.

Fortunately, Integral Human Pose Regression \cite{sunIntegralHumanPose2018} provides a solution: Instead of taking the maximum, they treat a heatmap as a probability distribution, taking the expected value of the coordinates by discrete integration. They refer to this as `integral regression'.  This is both differentiable and allows high accuracy with lower resolution heatmaps. We generalize their approach to generate what we call `spatialized features', treating activation maps as low resolution heat maps, then using integral regression to reduce their dimensionality while preserving the potential for accurate spatial information.

A key question for our proposed method is how to choose the focal point where all patches taken from the pyramid are centered. This too, we find in the pose estimation literature: Human Pose Estimation with Iterative Error Feedback \cite{carreiraHumanPoseEstimation2016} uses the older style of directly regressing coordinates with a CNN, but does so iteratively. In each iteration, they apply a Gaussian distribution heatmap centered on the prediction from the last iteration as part of the input to their CNN. However, their method processes the entire image each iteration. Our method only processes a glimpse of the image, with each iteration refining the focal point of the glimpse.

Though we use a simple iterative approach, in Evaluating Reinforcement Learning Agents for Anatomical Landmark Detection \cite{alansaryEvaluatingReinforcementLearning2019}, medical landmarks are found by a hierarchical coarse-to-fine search where the actions taken move a smaller region of interest processed by a reinforcement learning agent. This result coupled with \cite{mnihRecurrentModelsVisual2014} (from who we take our sampling approach) indicates that a reinforcement learning approach might be a promising future direction.

\section{Method}

\begin{figure}[t]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:arch}
  {\caption{\textbf{Left}: Starting from the mean landmark location, A $64 \times 64$ patch is sampled from each level, then the CNN assesses each scale individually (via several instances sharing parameters). The MLP takes the spatialized and concatenated outputs from the CNNs and regresses the predicted offset (the grey dotted arrow) to update the current estimate (the grey dot). The red dot is the ground truth. This process is repeated 10 times. \textbf{Right}: A visualization of what the network ``sees'': a stack of $64 \times 64$ images (a glimpse) centered on a focal point. Cropped for clarity.}}
  {\centerline{
    \includegraphics[width=.65\textwidth]{ArchitechtureOverview.pdf}
    \includegraphics[width=.4\textwidth]{pyramid_columns.png}}}
\end{figure}

We process one landmark at a time. Our method is based on only taking small `glimpses' of the input image. A glimpse $\mathbf{g}$ is a an $N\times64\times64$ volume of $N$ patches, where each 64$\times$64 patch $g_i$ is taken from an image downsampled at a factor of $2^i$. Each patch is centered on an initial estimated landmark location $\vec{\hat{x}}$.

To construct a glimpse, we first build a Gaussian pyramid \cite[Section 3.5]{szeliskiComputerVisionAlgorithms2011} $\mathbf{I}$ with $N$ levels. The first level $I_1$ is full size (i.e. it is the original image), and $N$ is set so that the size of $I_N$ is approximately the size of a glimpse ($64 \times 64$). For the x-ray images (image size 2400$\times$1935), $N=6$. We then sample patches $g_1\ldots g_N$, each from its corresponding level in the pyramid, and each centered on $\vec{\hat{x}}$. These are stacked into a glimpse $\vec{g}$. For training, $\vec{\hat{x}}$ is initialized randomly from a normal distribution with mean and standard deviation calculated from the training labels for the landmark, while for inference it is initialized to the exact mean of the training labels for the landmark. Each patch is then processed by the CNN.

\subsection{The CNN}

For the convolutional part of our method, we used a pretrained 34-layer ResNet \cite{heDeepResidualLearning2016} trained on ImageNet (provided by PyTorch \cite{paszke2019pytorch}). We make several modifications. As the X-rays are grayscale, the first convolution is modified to take a single channel input, using the weights from the green channel. The stride of this first layer is decreased from 2 to 1, in order to preserve spatial resolution. We remove the final 3 basic blocks (a total of 6 layers) and the final fully connected layer. Because we remove two downsamples by truncating the network, and we decrease the stride of the first convolution, the CNN produces an activation volume $\mathbf{A}$ of size $256\times8\times8$.

\subsection{Spatialized Features}

The 256 channel activation volume $\mathbf{A}$ (the output of the CNN applied to a single patch $g_i$ in the glimpse $\mathbf{g}$) can be thought of as 256 low resolution $8 \times 8$ heatmaps $A_1\ldots A_{256}$. We make the assumption that each channel ($8\times8$ heatmap $A_k$) encodes the spatial location of one point feature ($f_{kx},f_{ky}$ in equation \ref{eq:1}). With this assumption, we can use the landmark regression approach taken in \cite{sunIntegralHumanPose2018} to reduce each heatmap to a single point with an explicit spatial location. We derive 256 probability distributions by performing a softmax on each heatmap $A_k$ in $\mathbf{A}$, yielding distributions $p_1\ldots p_{256}$. Then, we take the expected value of the spatial location of each feature. This can also be seen as finding the center of mass \cite{tensmeyerRobustKeypointDetection2019a} for the given heatmap.

However, unlike \cite{sunIntegralHumanPose2018}, these point features are not the final output, but an intermediate representation, so it might be important that some point features be weighted more heavily than others (or weighted zero if they are not present in the given glimpse). In parallel with computing the location of the point feature, we also compute the expected value $f_{ka}$ of the raw activations in the heatmap. This results in the 3-vector $\vec{f}_k$: 
\begin{equation}
    \vec{f}_k = \begin{bmatrix}
           f_{kx} \\
           f_{ky} \\
           f_{ka}
         \end{bmatrix} = \sum_{y=1}^{H=8} \sum_{x=1}^{W=8} p_k(x,y)  \begin{bmatrix}
           (x-4.5)/4\\
           (y-4.5)/4\\
           A_k(x,y)
         \end{bmatrix} \label{eq:1}
\end{equation}

$f_{ka}$ can be seen as a `soft-max-pool' with a kernel of size 8$\times$8: it is a weighted average of the raw activations in the heatmap, weighted toward the maximum activations. This is exaggerated by softmax's emphasis on larger values. Note that the $(f_{kx},f_{ky})$ coordinate is normalized to lie in the range [-1,1], with the origin in the center of the heatmap. The location of each pixel in the heatmap is taken as that pixel's center.

Transforming to spatialized features reduces the input $\mathbf{A}$ of size $256\times8\times8$ to an output $F$ of size of $256\times3$. We flatten $F$ to a vector $\vec{s}_i$ of size 768. For a visualization of the intermediate heatmaps learned by this method, as well as a diagram providing intuition for the spatialization process, see figure \ref{fig:features}.

\begin{figure}[t]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:features}
  {\caption{\textbf{Left:} Examples of heatmaps (visualized as contours) learned as spatialized features, overlayed on the relevant patch from the glimpse. Each row visualizes the activation of a different heatmap $A_k$ in the same four x-rays. The red dot is the ground truth $\mathbf{\vec{x}}$ for the landmark this network was trained on (the sella). \textbf{Right:} A diagram showing the spatialization process. The heatmaps are fictional for illustrative purposes. $\mathbf{A}$ is an activation volume from the CNN. $\mathbf{F}$ is the corresponding spatialized features. Note that though we plot the illustrative spatialized features (dot with red arrow pointing to it) as roughly on the center of mass, in practice, the expected coordinates $(f_{kx},f_{ky})$ the network predicts, though pointing in the correct direction, end up at an arbitrary scale ``preferred'' by the network, so the raw values are not meaningful to plot.}}
  {\centerline{
    \includegraphics[width=.4\textwidth]{features_square.png}
    \includegraphics[width=.7\textwidth]{Spatialized.pdf}}
    }
\end{figure}

\subsection{The MLP}

To produce the input to the fully connect network (MLP), we take each of the $N$ 768-vectors $\vec{s}_1\ldots \vec{s}_N$ produced by the CNN and concatenate them into one flat $N\times768$ vector $\mathfrak{s}$. For our case of $N=6$, this yields a 4608-vector, which contains all spatialized features from all levels of the pyramid.

The MLP is a 3-layer network that has one hidden layer of width 512 with relu activation, then one hidden layer of width 128 with relu activation, and then a final linear layer of width 2 that regresses the estimated offset $\mathbf{\bar{x}}$. We initialize the layers of the MLP orthogonally as described in \cite{saxeExactSolutionsNonlinear2014}, which seems to induce faster and more repeatable convergence. 

To refine our current landmark location estimate $\mathbf{\hat{x}}_t$, we add our new offset estimate $\mathbf{\bar{x}}$: $\mathbf{\hat{x}}_{t+1} = \mathbf{\hat{x}}_{t}+\mathbf{\bar{x}}_{t}$. The whole process just described is then repeated with the new location estimate $\mathbf{\hat{x}}_{t+1}$. With each iteration, our method should get closer to the target, meaning that higher resolution patches from the glimpse should contain the target landmark, until finally the landmark is visible in the highest resolution patch and a very accurate estimation can be made (see algorithm \ref{alg:net}).


\section{Training}

We trained the network using the ADAM optimizer with learning rate 1e-4 for 20 epochs and a learning rate 1e-5 for 20 epochs, for a total of 40 epochs. We used a batch size of 2 images. As well as the random initial estimate mentioned earlier, we built some data augmentation into the glimpse sampling process: during training, a glimpse was randomly rotated ($\pm 15^\circ$) and scaled ($\pm 5\%$), and then the inverse was applied to the offset estimate $\mathbf{\bar{x}}$. Note that because of the iterative error-feedback method we use, each step in the epoch actually corresponds to 10 weight updates (one per iteration). The iterations are independent; there is no backpropogation through time (as in a recurrent neural network). We use an $\ell_1$ loss (as in \cite{sunIntegralHumanPose2018}), as we found it worked well. This is somewhat intuitive, as the $\ell_1$ loss `scale-free'; the size of the gradient step is only affected by the direction to the target (not the distance), meaning that features across all scales should be learned at roughly the same rate. We trained one network for each landmark, for a total of 19 networks per run.

\begin{algorithm2e}
\caption{Procedure for a Single Image}
\label{alg:net}
 % older versions of algorithm2e have \dontprintsemicolon instead
 % of the following:
 \DontPrintSemicolon
 % older versions of algorithm2e have \linesnumbered instead of the
 % following:
 %\LinesNumbered
\KwIn{Image $\mathbf{X}$}
\KwOut{Estimated Landmark Location $\mathbf{\hat{x}}$}
\BlankLine
$\vec{\mu} = $ mean landmark position and $\vec{\sigma} = $ standard deviation from the training set\;
\uIf{training}{
Initialize initial location estimate: $\mathbf{\hat{x}} \sim \mathcal{N}(\vec{\mu},\vec{\sigma})$\;
}\Else{
Initialize initial location estimate: $\mathbf{\hat{x}} \leftarrow \vec{\mu}$\;
}
Initialize Gaussian Pyramid $\mathbf{I}$ with $N$ levels from image $\mathbf{X}$\;
\For{$t\leftarrow1$ \KwTo $10$}{
Initialize an empty vector of spatialized features $\mathfrak{s}$\;
\For{$i\leftarrow1$ \KwTo $N$}{
    Crop a zero padded 64$\times$64 glimpse patch $g_i$ from pyramid level $I_i$ centered on $\mathbf{\hat{x}}$\;
    Process $g_i$ with the CNN to produce a $C \times H \times W$ activation volume $\mathbf{A}$\;
    Spatialize the channels of $\mathbf{A}$ into a flat $3 \times C$ vector $\vec{s}_i$ of $C$ spatialized features\;
    Append $\vec{s}_i$ to $\mathfrak{s}$\;
    
  }
  
  Process $\mathfrak{s}$ with the MLP to produce an offset estimate $\mathbf{\bar{x}}$\;
  
  Update the current location estimate: $\mathbf{\hat{x}} \leftarrow \mathbf{\hat{x}} + \mathbf{\bar{x}}$\;
  \If{training}{
  Backpropogate the $\ell_1$ error of the label $\mathbf{x}$ and the current estimate:  $||\mathbf{x}-\mathbf{\hat{x}}||_1$\;
  }
  
}
\end{algorithm2e}

\section{Results}

We ran two experiments. In the first we followed the original protocol of the challenge  \cite{wangBenchmarkComparisonDental2016}, splitting the 400 x-rays into a training set of 150, a Test 1 dataset of 150, and a Test 2 dataset of 100, and using the average of the labels as ground truth (the dataset was labelled by two doctors). In the second we ran four-fold cross validation on all 400 images using the junior doctor's labels as the ground truth, as in \cite{lindnerFullyAutomaticSystem2016a}. 

We show results averaged across all landmarks for Mean Radial Error $\pm$ Standard Deviation (MRE) in millimeters and several successful detection rate (SDR) thresholds. MRE is the Euclidean distance from predictions to ground truth averaged across all landmarks in all images. SDR is the percentage of all predicted landmarks below a given threshold distance from ground truth. There are 10 pixels per millimeter. We also report the inter-observer variability. The inter-observer variability for a given landmark is the average of the distance from each of the two labels to their mean (the ground truth).

All of our reported results are state-of-the-art in their respective categories. Additionally, we are within the inter-observer variability for four-fold cross validation and Test 1. See \appendixref{appendix:res} for our results reported by landmark.

\setlength{\tabcolsep}{4pt}

\begin{table}[t]
\floatconts
  {tab2}
  {\caption{Comparison with other methods (average results over all landmarks). MRE is Mean Radial Error $\pm$ standard deviation. SDR is Successful Detection Rate, i.e. what percentage of test points were within a given radial threshold of the ground truth. The results are vertically separated into their respective training/test sets.}}{
\begin{tabular}{|l|l|c|c|c|c|c|}
\hline
\multicolumn{3}{|c|}{} & \multicolumn{4}{|c|}{\textbf{SDR \%}}\\
\hline
\textbf{Data} & \textbf{Method} & \textbf{MRE (mm)} & \textbf{2.0mm} & \textbf{2.5mm} & \textbf{3.0mm} & \textbf{4.0mm}\\
\hline

4-fold &
Inter-Observer Variability & 1.07 $\pm$ 0.80 & 85.00 & 90.14 & 93.59 & 97.07 \\\cline{2-7}
& Lindner \etal (2016) & 1.20 $\pm$ 0.60\footnote{Reconstructed from originally reported standard error to be consistent with others' reporting} & 84.70 & 89.38 & 92.62 & 96.30\\
& Zhong \etal (2019) & 1.22 $\pm$ 2.45 & 86.06 & 90.84 & 94.04 & 97.28 \\
& Ours & \textbf{1.07} $\pm$ 0.95 & \textbf{86.72} & \textbf{92.03} & \textbf{94.93} & \textbf{97.82}\\
\hline
Test 1 &
Inter-Observer Variability & 1.18 $\pm$ 0.78 & 81.44 & 88.28 & 93.09 & 97.58\\\cline{2-7}
& Lindner \& Cootes (2015) & 1.67 $\pm$ 1.65 & 74.95 & 80.28 & 84.56 & 89.68\\
& Ibragimov \etal (2015) & 1.84 $\pm$ 1.76 & 71.72 & 77.40 & 81.93 & 88.04\\
& Arik \etal (2017) & & 75.37 & 80.91 & 84.32 & 88.25 \\
& Qian \etal (2019) & & 82.50 & 86.20 & 89.30 & 90.60 \\
& Zhong \etal (2019) & 1.12 $\pm$ 1.03 & 86.91 & 91.82 & 94.88 & 97.90 \\
& Ours & \textbf{1.01} $\pm$ 0.85 & \textbf{88.32} & \textbf{93.12} & \textbf{96.14} & \textbf{98.63}\\
\hline
Test 2 & Inter-Observer Variability & 0.76 $\pm$ 0.55 & 94.74 & 97.37 & 98.32 & 99.32 \\\cline{2-7}
& Lindner \& Cootes & & 66.11 & 72.00 & 77.63 & 87.43\\
& Ibragimov \etal & & 62.74 & 70.47 & 76.53 & 85.11\\
& Arik \etal & & 67.68 & 74.16 & 79.11 & 84.63 \\
& Qian \etal (2019) & & 72.40 & 76.15 & 79.65 & 85.90 \\
& Zhong \etal (2019) & 1.42 $\pm$ 0.84 & 76.00 & 82.90 & 88.74  & 94.32 \\
& Ours & \textbf{1.33} $\pm$ 0.74 & \textbf{77.05} & \textbf{83.16} & \textbf{88.84} & \textbf{94.89}\\
\hline
\end{tabular}
}
\end{table}

\section{Discussion}

Our multiresolution approach to learning features across all scales with the same pretrained CNN seems to make good use of transfer learning. This makes sense, as CNNs are typically trained as to be somewhat scale invariant, because the same objects may be seen at many scales due to perspective. We can use features previously learned by the CNN across all scales, despite the images in the x-ray dataset all being at the same scale. This also seems to help with overfitting. Glimpses (as we use them) are a heavily augmented representation of the data --- we explode the training set into many crops at many resolutions.

It is interesting to note that though we found a fairly high number of iterations (10) was required during training to get the best results (regardless of number of epochs), inference worked well with surprisingly few iterations, converging to a state-of-the-art estimate in only 3 iterations. On the Test 1, reducing the inference iteration count from 10 to 5 only increased the MRE by a negligible 0.002 mm, and reducing it to 3 only increased the MRE by 0.015 mm. This suggests the high iteration count during training is effective because it biases the training process toward sampling the region near the landmark (rather than because the task inherently requires many iterations). Part of the insight behind this approach is that our iterative method will end up sampling image locations that are mistakenly identified as correct by previous iterations, meaning that it specifically learns to correct mistakes it is likely to have made.

The success of this approach is very promising for large images. If image pyramids were precomputed and tiled into a database, it seems possible that storage space could be the bottleneck, rather than memory/compute usage, as  each iteration would only load a small glimpse of the image in proportion to the log of its side length.

Code is available at \url{https://github.com/logangilmour/FoveatedPyramid}.

% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{
This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca). We thank Kirby Banman and the MIDL anonymous reviewers for helpful feedback that significantly improved the manuscript.
}


\bibliography{gilmour20}
\break
\appendix
\section{Extended Results}
\label{appendix:res}

\begin{table}[ht]
\floatconts
  {table:full1}
  {\caption{Test 1 results by landmark. MRE is Mean Radial Error $\pm$ Standard Deviation. IOV is the mean radial error as applied to the inter-observer variability. SDR is Successful Detection Rate.}}{
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
& & & \multicolumn{4}{|c|}{SDR \%}\\
\hline
Landmark & MRE (mm) & IOV (mm) & 2.0mm & 2.5mm & 3.0mm & 4.0mm\\

\hline
Sella (L1) & 0.62 $\pm$ 2.11 & \textbf{0.51} $\pm$ 0.91 & 98.67 & 98.67 & 98.67 & 98.67\\
Nasion (L2) & 1.06 $\pm$ 1.01 & \textbf{0.97} $\pm$ 1.18 & 87.33 & 92.00 & 93.33 & 97.33\\
Orbitale (L3) & \textbf{1.15} $\pm$ 0.81 & 1.62 $\pm$ 0.91 & 86.67 & 92.00 & 96.67 & 99.33\\
Porion (L4) & \textbf{1.69} $\pm$ 1.13 & 1.69 $\pm$ 0.94 & 67.33 & 75.33 & 84.67 & 96.67\\
Subspinale (L5) & \textbf{1.55} $\pm$ 1.04 & 1.68 $\pm$ 1.03 & 73.33 & 85.33 & 92.67 & 98.00\\
Supramentale (L6) & \textbf{0.97} $\pm$ 0.68 & 1.60 $\pm$ 1.11 & 91.33 & 95.33 & 99.33 & 100.00\\
Pogonion (L7) & 0.84 $\pm$ 0.69 & \textbf{0.79} $\pm$ 0.51 & 93.33 & 96.00 & 98.67 & 100.00\\
Menton (L8) & 0.80 $\pm$ 0.66 & \textbf{0.69} $\pm$ 0.48 & 94.00 & 98.67 & 99.33 & 99.33\\
Gnathion (L9) & 0.79 $\pm$ 0.64 & \textbf{0.62} $\pm$ 0.41 & 93.33 & 98.00 & 98.67 & 99.33\\
Gonion (L10) & 1.70 $\pm$ 1.04 & \textbf{1.19} $\pm$ 0.89 & 68.67 & 78.67 & 88.67 & 94.67\\
Incision inferius (L11) & 0.50 $\pm$ 0.64 & \textbf{0.35} $\pm$ 0.39 & 95.33 & 97.33 & 98.00 & 99.33\\
Incision superius (L12) & 0.39 $\pm$ 0.46 & \textbf{0.26} $\pm$ 0.43 & 95.33 & 99.33 & 100.00 & 100.00\\
Upper lip (L13) & \textbf{1.26} $\pm$ 0.54 & 1.89 $\pm$ 0.64 & 90.67 & 97.33 & 99.33 & 100.00\\
Lower lip (L14) & \textbf{0.81} $\pm$ 0.37 & 1.53 $\pm$ 0.57 & 99.33 & 99.33 & 100.00 & 100.00\\
Subnasale (L15) & 0.73 $\pm$ 0.62 & \textbf{0.72} $\pm$ 0.42 & 96.00 & 96.67 & 99.33 & 100.00\\
Soft tissue pogonion (L16) & \textbf{1.04} $\pm$ 0.82 & 3.25 $\pm$ 1.17 & 90.67 & 96.67 & 96.67 & 99.33\\
Posterior nasal spine (L17) & \textbf{0.77} $\pm$ 0.63 & 0.84 $\pm$ 0.73 & 95.33 & 97.33 & 98.67 & 99.33\\
Anterior nasal spine (L18) & 1.06 $\pm$ 1.10 & \textbf{0.98} $\pm$ 0.84 & 88.00 & 92.67 & 94.67 & 98.00\\
Articulare (L19) & 1.41 $\pm$ 1.17 & \textbf{1.26} $\pm$ 1.34 & 73.33 & 82.67 & 89.33 & 94.67\\
\hline
Average & \textbf{1.01} $\pm$ 0.85 & 1.18 $\pm$ 0.78 & 88.32 & 93.12 & 96.14 & 98.63\\

\hline
\end{tabular}
}
\end{table}

\begin{table}[ht]
\floatconts
  {table:full2}
  {\caption{Test 2 results by landmark. MRE is Mean Radial Error $\pm$ Standard Deviation. IOV is the mean radial error as applied to the inter-observer variability. SDR is Successful Detection Rate.}}{
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
& & & \multicolumn{4}{|c|}{SDR \%}\\
\hline
Landmark & MRE (mm) & IOV (mm) & 2.0mm & 2.5mm & 3.0mm & 4.0mm\\

\hline
Sella (L1) & \textbf{0.43} $\pm$ 0.35 & 0.44 $\pm$ 0.21 & 99.00 & 99.00 & 100.00 & 100.00\\
Nasion (L2) & 0.80 $\pm$ 0.89 & \textbf{0.62} $\pm$ 0.75 & 90.00 & 96.00 & 97.00 & 98.00\\
Orbitale (L3) & 2.24 $\pm$ 0.93 & \textbf{1.28} $\pm$ 0.83 & 43.00 & 59.00 & 81.00 & 96.00\\
Porion (L4) & 1.57 $\pm$ 1.66 & \textbf{1.26} $\pm$ 1.38 & 77.00 & 83.00 & 87.00 & 93.00\\
Subspinale (L5) & 1.12 $\pm$ 0.69 & \textbf{0.65} $\pm$ 0.47 & 89.00 & 96.00 & 98.00 & 100.00\\
Supramentale (L6) & 2.71 $\pm$ 1.21 & \textbf{1.33} $\pm$ 0.69 & 32.00 & 44.00 & 58.00 & 84.00\\
Pogonion (L7) & \textbf{0.54} $\pm$ 0.48 & 0.60 $\pm$ 0.39 & 99.00 & 99.00 & 99.00 & 100.00\\
Menton (L8) & \textbf{0.53} $\pm$ 0.41 & 0.69 $\pm$ 0.45 & 99.00 & 100.00 & 100.00 & 100.00\\
Gnathion (L9) & \textbf{0.47} $\pm$ 0.31 & 0.47 $\pm$ 0.30 & 100.00 & 100.00 & 100.00 & 100.00\\
Gonion (L10) & 1.23 $\pm$ 0.79 & \textbf{1.06} $\pm$ 0.77 & 86.00 & 93.00 & 97.00 & 99.00\\
Incision inferius (L11) & 0.49 $\pm$ 0.50 & \textbf{0.29} $\pm$ 0.28 & 98.00 & 99.00 & 100.00 & 100.00\\
Incision superius (L12) & 0.35 $\pm$ 0.56 & \textbf{0.23} $\pm$ 0.19 & 98.00 & 98.00 & 98.00 & 99.00\\
Upper lip (L13) & 2.65 $\pm$ 0.56 & \textbf{0.79} $\pm$ 0.34 & 14.00 & 37.00 & 74.00 & 100.00\\
Lower lip (L14) & 1.83 $\pm$ 0.63 & \textbf{0.74} $\pm$ 0.41 & 67.00 & 85.00 & 94.00 & 100.00\\
Subnasale (L15) & 0.78 $\pm$ 0.59 & \textbf{0.72} $\pm$ 0.47 & 95.00 & 99.00 & 99.00 & 100.00\\
Soft tissue pogonion (L16) & 4.40 $\pm$ 1.30 & \textbf{1.37} $\pm$ 0.87 & 3.00 & 5.00 & 13.00 & 35.00\\
Posterior nasal spine (L17) & 0.96 $\pm$ 0.61 & \textbf{0.57} $\pm$ 0.42 & 94.00 & 98.00 & 99.00 & 100.00\\
Anterior nasal spine (L18) & 1.03 $\pm$ 0.70 & \textbf{0.71} $\pm$ 0.64 & 93.00 & 96.00 & 96.00 & 100.00\\
Articulare (L19) & 1.11 $\pm$ 0.81 & \textbf{0.56} $\pm$ 0.57 & 88.00 & 94.00 & 98.00 & 99.00\\
\hline
Average & 1.33 $\pm$ 0.74 & \textbf{0.76} $\pm$ 0.55 & 77.05 & 83.16 & 88.84 & 94.89\\

\hline
\end{tabular}
}
\end{table}

\begin{table}[ht]
\floatconts
  {table:full4fold}
  {\caption{Results of 4 fold cross validation by landmark. All results are averaged across the 4 runs. MRE is Mean Radial Error $\pm$ Standard Deviation. IOV is the mean radial error as applied to the inter-observer variability. SDR is Successful Detection Rate.}}{
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
& & & \multicolumn{4}{|c|}{SDR \%}\\
\hline
Landmark & MRE (mm) & IOV (mm) & 2.0mm & 2.5mm & 3.0mm & 4.0mm\\

\hline
Sella (L1) & 0.59 $\pm$ 0.78 & \textbf{0.46} $\pm$ 0.59 & 99.00 & 99.25 & 99.50 & 99.50\\
Nasion (L2) & 0.98 $\pm$ 1.14 & \textbf{0.76} $\pm$ 0.98 & 87.00 & 90.25 & 92.75 & 97.00\\
Orbitale (L3) & \textbf{1.21} $\pm$ 1.18 & 1.54 $\pm$ 0.94 & 80.75 & 86.75 & 90.25 & 95.75\\
Porion (L4) & \textbf{1.61} $\pm$ 1.79 & 1.66 $\pm$ 1.14 & 77.25 & 84.00 & 88.25 & 91.00\\
Subspinale (L5) & 1.52 $\pm$ 1.14 & \textbf{1.45} $\pm$ 1.15 & 75.25 & 83.50 & 88.25 & 96.25\\
Supramentale (L6) & \textbf{1.16} $\pm$ 0.78 & 1.51 $\pm$ 0.98 & 84.25 & 93.50 & 97.00 & 99.50\\
Pogonion (L7) & 0.98 $\pm$ 0.70 & \textbf{0.62} $\pm$ 0.45 & 89.75 & 95.25 & 98.75 & 100.00\\
Menton (L8) & 0.80 $\pm$ 0.64 & \textbf{0.66} $\pm$ 0.48 & 95.25 & 96.25 & 98.25 & 99.75\\
Gnathion (L9) & 0.81 $\pm$ 0.68 & \textbf{0.50} $\pm$ 0.36 & 95.75 & 98.50 & 98.75 & 99.25\\
Gonion (L10) & 1.51 $\pm$ 1.12 & \textbf{1.43} $\pm$ 1.03 & 72.75 & 83.00 & 90.50 & 96.75\\
Incision inferius (L11) & 0.53 $\pm$ 0.63 & \textbf{0.33} $\pm$ 0.36 & 96.25 & 97.25 & 98.25 & 99.25\\
Incision superius (L12) & 0.48 $\pm$ 0.80 & \textbf{0.24} $\pm$ 0.34 & 95.25 & 96.25 & 97.75 & 99.50\\
Upper lip (L13) & 1.50 $\pm$ 0.74 & \textbf{1.36} $\pm$ 0.74 & 73.00 & 88.50 & 96.25 & 100.00\\
Lower lip (L14) & 1.12 $\pm$ 0.66 & \textbf{1.09} $\pm$ 0.65 & 89.25 & 95.25 & 98.75 & 99.75\\
Subnasale (L15) & 1.07 $\pm$ 0.86 & \textbf{0.81} $\pm$ 0.56 & 90.00 & 94.75 & 96.00 & 98.25\\
Soft tissue pogonion (L16) & \textbf{1.25} $\pm$ 1.15 & 3.29 $\pm$ 1.78 & 84.00 & 90.00 & 91.75 & 96.75\\
Posterior nasal spine (L17) & 0.91 $\pm$ 0.82 & \textbf{0.72} $\pm$ 0.59 & 93.50 & 96.50 & 97.50 & 98.00\\
Anterior nasal spine (L18) & 1.31 $\pm$ 1.19 & \textbf{0.91} $\pm$ 0.82 & 78.00 & 85.25 & 89.00 & 94.50\\
Articulare (L19) & \textbf{0.98} $\pm$ 1.26 & 1.06 $\pm$ 1.25 & 91.50 & 94.50 & 96.25 & 97.75\\
\hline
Average & \textbf{1.07} $\pm$ 0.95 & 1.07 $\pm$ 0.80 & 86.72 & 92.03 & 94.93 & 97.82\\

\hline
\end{tabular}
}
\end{table}



\end{document}
