\begin{figure}[t]
\floatconts
  {fig:mrpyrnet}
  {\caption{Processing procedure of \myalgoname. Each MRI slice $\slice_i$ passes through a backbone CNN (represented by the gray shapes). The feature maps produced are exploited in an FPN strategy (represented by the green shapes), and in turn analyzed by the \modulename\ modules (light red shapes). These build the vectors $\{ \pooledvec_{i,l} \}_{l=1}^{L}$ which consist of representations containing MRI information at multiple levels of detail.}}
  {\includegraphics[width=.92\linewidth]{images/pipeline.pdf}}
\end{figure}


\section{Methodology}
In this section, we describe the processing procedure of \myalgoname\ and of its components, which are shown as a whole in Figure \ref{fig:mrpyrnet}.
We assume an MRI exam consisting of a sequence of \rev{$S \times \width \times \height \times C$ slices, where $C$ is the number of image channels used to represent a slice,} is given in input to a standard CNN-based pipeline \citep{MRNet,ELNet}. 

\paragraph{Backbone.} Each slice $\slice_i, i \in \{0, \cdots, S-1\}$ of an MRI exam is first processed by a backbone CNN (e.g. MRNet's AlexNet) that produces feature map representations carrying semantically different information. This is achieved by considering the output feature maps of the backbone CNN's intermediate layers. More formally, we consider $\featuremaps_{i,l} \in \reals^{\width_l \times \height_l \times C_l}$ as the output of the $l$-th layer of a backbone CNN as given by the function
$\featfun_{\level}(\slice_i)$ which is inputted with $\slice_i$. In our setting, we consider $l \in \{1, \cdots, \levels\}, \levels=5$, thereby exploiting 5 semantically-different CNN outputs.     
The execution of the backbone thus generates the set of feature maps $\{ \featuremaps_{i,1}, \featuremaps_{i,2}, \featuremaps_{i,3}, \featuremaps_{i,4}, \featuremaps_{i,5}\}$.

\paragraph{Feature Pyramid Network.} Since it has been shown that knee anomalies occupy small regions in the MRI slices \citep{Hash2013,Nguyen2014,Lecouvet2018}, we exploit an FPN architecture \citep{FPN} which demonstrated to be particularly effective for the detection of small objects.
Following \citet{FPN}, we exploit the outputs $\{\featuremaps_{i,l}\}_{l=1}^{L}$ of the backbone by combining the semantically-stronger features of higher layers with the more accurately localized features of lower layers. 
In particular, at each level $\level$, the higher-level feature maps $\fpnfeaturemaps_{i,l+1}$ 
are up-sampled by bilinear interpolation and transformed by a convolutional layer and ReLU activation, to match the spatial and channel dimensions of $\featuremaps_{i,l}$. Then, these features are used to enhance, by element-wise sum, $\featuremaps_{i,l}$ which are previously transformed by a $1\times1$ convolution with ReLU activation (so-called lateral connection). The two steps generate the FPN features $\fpnfeaturemaps_{i,l}$.
The whole procedure is executed from the highest layer to the lowest (top-down pathway), resulting in the set of feature maps $\{\fpnfeaturemaps_{i,l} \}_{l=1}^{L}$. At the highest level (i.e. $\level=5$) $\fpnfeaturemaps_{i,5} = \featuremaps_{i,5}$ is set.

\begin{figure}[t]
\floatconts
  {fig:pyrmodule}
  {\caption{Visual representation of the operation performed by the \modulename\ module with $\pyramidlevels = 3$. \modulename\ gets in input the FPN feature maps $\fpnfeaturemaps_{i,l}$, and a set of $\pyramidlevels$ sub-regions, located in $\fpnfeaturemaps_{i,l}$ center, with sizes $(\width_{\level,\pyramidlevel}, \height_{\level,\pyramidlevel})$ that are obtained after the dimensions $(\width_{0,\pyramidlevel}, \height_{0,\pyramidlevel})$ generated at slice level. For each detail level $\pyramidlevel$, $\fpnfeaturemaps_{i,l}$ is pooled channel-wise by means of the function $\poolfun(\cdot)$ in the area defined by each size $(\width_{\level,\pyramidlevel}, \height_{\level,\pyramidlevel})$. This operation generates $\pyramidlevels$ vectors that are then concatenated together into $\pooledvec_{i,l}$ which is the output of the proposed module.}}
  {\includegraphics[width=.92\linewidth]{images/pdp.pdf}}
\end{figure}

\paragraph{Pyramidal Detail Pooling.}
The FPN strategy enhances the quality of features for tasks that require precise spatial information, giving the freedom of designing arbitrary methods to exploit such information.
We introduce \modulename, a pyramidal feature pooling module capable of capturing the relevant information of the knee disorder \citep{Hash2013,Nguyen2014,Lecouvet2018}. 
\modulename\ analyzes the FPN representations at multiple levels of detail by focusing on increasingly smaller sub-regions localized in the slice center.
The module takes as input a feature map tensor and a series of sub-regions dimensions, and produces a vectorized representation combining the information contained in each sub-region. 
In more details, \modulename\ (depicted in Figure \ref{fig:pyrmodule}) is fed with $\fpnfeaturemaps_{i,l}$ and with a list of $\pyramidlevels \in \integers$ sub-regions of size $(\width_{\level,\pyramidlevel}, \height_{\level,\pyramidlevel})$ where $\pyramidlevel \in \{0, \cdots, \pyramidlevels-1\}$.
For each $\pyramidlevel$, the module crops channel-wise the sub-tensor of size $(\width_{\level,\pyramidlevel}, \height_{\level,\pyramidlevel})$ localized at the feature maps center having coordinates $x = \big\lfloor \frac{\width_l}{2} \big\rceil, y = \big\lfloor \frac{\height_l}{2} \big\rceil$.
Such sub-tensor 
is processed by a global pooling function $\poolfun(\cdot)$ which reduces each feature map into a single value, thus obtaining a vector $\pooledvec_{i,\level,\pyramidlevel}$ of length $\numfeaturemaps_l$. Finally, the $\pyramidlevels$ obtained vectors are concatenated together to obtain a single vector representation $\pooledvec_{i,\level}$ of length $\pyramidlevels \cdot \numfeaturemaps_{\level}$, which corresponds to the output of the proposed module. 
The dimensions $(\width_{\level,\pyramidlevel}, \height_{\level,\pyramidlevel})$ are obtained by mapping the slice-level dimensions $(\width_{0,\pyramidlevel}, \height_{0,\pyramidlevel}), \pyramidlevel \in \{0, \cdots,\pyramidlevels-1\}$ following the 2D dimension reduction definitions induced by the sub-sampling operations (i.e. convolutional/pooling operations with relative kernel/stride/padding sizes) of the backbone layers. Just sub-regions resulting in $\width_{\level,\pyramidlevel} > 0, \height_{\level,\pyramidlevel} > 0$ are retained. If multiple slice-level sub-regions map to the same representation-level sub-region, just a single one is considered.
Based on experiments, we propose a simple but effective strategy to compute sub-regions candidates in containing the knee anomaly.
We generate $(\width_{0,\pyramidlevel}, \height_{0,\pyramidlevel})$ by considering
\begin{align}
   X_{0,\pyramidlevel} = X \cdot (1 - \frac{\pyramidlevel }{\pyramidlevels}), X \in \{W,H\}, \pyramidlevel \in \{0, \cdots, \pyramidlevels-1\}. 
\end{align}
In simple words, at each $\pyramidlevel$ we consider a sub-region which focuses on an increasingly smaller part of the MRI slice, thus increasing the level of detail for that part of the slice.
Defining sub-regions at the slice level simplifies the design of detailing strategies and makes the process independent from the backbone architecture. Moreover, this strategy requires the description of a single hyper-parameter $\pyramidlevels$ to control both the number of detail levels and their size.
\rev{The overall \modulename\ strategy is designed to be robust. 
Indeed, if knee anomaly's features are out of the scope of an inner sub-region it can be captured by the outer sub-region of a previous detail level. At the lowest level (i.e. $\pyramidlevel = 0$) the sub-region matches the dimensions of the backbone's feature maps, thus guaranteeing a lower-bound feature exploitation at least as good as the backbone's.}



\paragraph{Feature Combination and Output Prediction.} To obtain a single representation for the \rev{whole} MRI exam, we follow a similar strategy to \citet{MRNet,ELNet}. \rev{At each backbone level $\level$,} a max-pooling operation is \rev{applied} series-wise \rev{(therefore across slices)} to $\pooledvec_{i,l}$. \rev{These operations result in} the set of \rev{single} vector representations $\{\pooledvec_l \}_{l=1}^{L}$ that summarize the information contained in the MRI sequence at multiple center-focused levels of detail.
Each of these vectors is finally given to a separate fully connected layer that predicts the probability $\modeloutput_{\level}$ of presence/absence of a knee disorder.
Based on experiments, we found the maximum of these probabilities to be the best estimate of knee anomaly presence.


\paragraph{Model Learning.}
For the optimization of the whole \myalgoname\ pipeline,
each MRI exam belonging to the training set is considered as a training sample. The probability estimates $\{\modeloutput_l\}_{l=1}^{L}$ of pathology presence/absence are obtained via the procedure described in the previous sections. Each prediction is compared to the ground-truth label $\groundtruth$ via the loss function $\loss(\modeloutput_l, \groundtruth)$ which depends on the considered backbone (details follow). The overall optimization goal is set to be $\sum_{l=1}^{L} \loss(\modeloutput_l, \groundtruth)$.

