\section{Experiments and Results}\label{sec:experiments-and-results}
To evaluate the framework described in \sectionref{sec:proposed-framework}, each step was implemented using PyTorch and evaluated on the datasets of \sectionref{sec:datasets}. To improve quality control measurements, we tested each step individually.
Because of the relatively small datasets, we used the EfficientNet-B0 \cite{tanEfficientNetRethinkingModel2019} for classification. For segmentation a DeepLabV3 \cite{chenRethinkingAtrousConvolution2017} with a ResNet-50 \cite{heDeepResidualLearning2016} backbone was used. Both networks were not pretrained.
For all experiments we padded the input radiograph with zeros to get the desired size while maintaining the aspect ratio.
Furthermore, the training radiographs were augmented with random cropping, histogram normalization, Gaussian noise, blurring, horizontal flipping and rotation.
Training and test datasets were split with an 80/20 ratio.

\subsection{Recognition of the Radiographic View}\label{sec:experiments-and-results:recognition-of-the-radiographic-view}
For the recognition of different radiographic views, the dataset described in \sectionref{sec:datasets:weakly-labeled-dataset} was used. Therefore, the last layer of the EfficientNet-B0 was modified to output two classes, either \textit{LAT} or \textit{AP}, which was followed by a softmax layer to obtain class probabilities. The model was trained using the cross-entropy as loss function and stochastic gradient descent (SGD) as optimizer using a learning rate of \(1 \cdot 10^{-3}\), a momentum of 0.9, a weight decay of \(1 \cdot 10^{-5}\), and a batch size of 8 over 500,000 iterations. To reduce possible overfitting, the drop connect \cite{wanRegularizationNeuralNetworks2013} rate was set to 0.4. The resulting input size of the radiographs, after augmentation, was \(224 \times 224\) pixels.

Training with these parameters resulted in an accuracy of 98.4\% for the test set and 98.5\% for the training set. The results must be interpreted with a certain caution due to the potentially incorrectly assigned labels in the weakly labeled dataset. 
It may be that
\begin{inparaenum}[(i)]
    \item the model predicts the correct class but the label is assigned incorrectly or that
    \item the model predicts the incorrect class and the label is also assigned incorrectly.
\end{inparaenum}
Reviewing the resulting radiographs for case (ii) revealed 54 wrong labels for the test set and 244 for the trainings set. Taking this into account the accuracy increased to 99.5\%, respectively to 99.7\% for the training set.
Although the actual accuracy may be slightly lower due to errors of case (i), these results clearly demonstrate that a recognition of the radiographic view can be achieved with high precision.


\subsection{Extraction of the ROI}\label{sec:experiments-and-results:extraction-of-the-roi}
To segment the ROI, a DeepLabV3 was trained with the labels described in \sectionref{sec:datasets:diagnostic-quality-dataset}. 
The target feature map is binary, with 0 for \textit{not ROI} and 1 for \textit{ROI}. 
As segmentation output we used a single feature map, followed by a sigmoid function, to get pixel-wise outputs from 0 to 1.
For the training we used the mean over the pixel wise squared error, optimized with the Adam optimizer, a learning rate of \(1 \cdot 10^{-4}\), a weight decay of \(1 \cdot 10^{-4}\), and a batch size of 4 over 50,000 iterations. For this task the input size after augmentation was \(400 \times 400\) pixels. This training was done separately for \textit{LAT} and \textit{AP} views. Given the small dataset we used a random sub-sampling validation over 12 different dataset splits.

To measure the accuracy of the predicted ROIs the Dice score was calculated. If a pixel value of the output feature map was above a threshold of 0.7, the pixel was classified as part of the ROI. Over all 12 dataset splits the mean Dice score was 94.17\% on the \textit{AP} views and 85.91\% on the \textit{LAT} views. A reason for the worse result on the \textit{LAT} views might be that the ROIs on the \textit{LAT} views are significantly smaller than on the \textit{AP} view and thus harder to predict. Regardless of this difference in the Dice score the resulting segmentations are sufficient to get bounding boxes of the ROIs, which can be seen in \figureref{fig:experiments-and-results:roi-segmentation:example-roi}. To extract bounding boxes based on the segmentation, first the smallest fitting rectangle of the segmentation is calculated and then rotated to be horizontal. Examples with the labeled and the predicted ROIs can be seen in \figureref{fig:experiments-and-results:roi-segmentation:example-roi}.

\begin{figure}[htbp]
    % Caption and label go in the first arguments and the figure contents
    % go in the last argument
    \floatconts{fig:experiments-and-results:roi-segmentation:example-roi}{\caption{In \protect\subfigref{fig:experiments-and-results:roi-segmentation:example-roi:ap} two radiographs in the \textit{AP} view are shown. Their labeled ROI is marked with a blue box and the predicted ROI with a red box. The predicted segmentation mask used to construct the red box is highlighted. The same is shown in \protect\subfigref{fig:experiments-and-results:roi-segmentation:example-roi:lat} for the \textit{LAT} view. Both examples also show that the proportion of ROI in the radiograph can vary greatly.}}
    {
        \subfigure[]{\label{fig:experiments-and-results:roi-segmentation:example-roi:ap}\includegraphics[height=4.2cm]{images/example-roi-ap.png}}
        \subfigure[]{\label{fig:experiments-and-results:roi-segmentation:example-roi:lat}\includegraphics[height=4.2cm]{images/example-roi-lat.png}}
    }
\end{figure}


\subsection{Quality Assessment}\label{sec:experiments-and-results:quality-prediction}
For the quality assessment task an EfficientNet-B0 was used.
To preserve the intrinsic order of the classes we modeled the task as a regression. One benefit of using regression is that we obtain intermediate scores.
We also trained classification networks using the earth mover's distance but this led to slightly worse results.
The model was trained using the mean squared error (MSE) as loss and the mean label of the four radiologist as target. The loss was minimized by SGD using a learning rate of \(1 \cdot 10^{-3}\), a momentum of 0.9, a weight decay of \(1 \cdot 10^{-3}\), and a batch size of 16 over 500,000 iterations.
As in \sectionref{sec:experiments-and-results:recognition-of-the-radiographic-view} the input size was \(224 \times 224\) pixels. The same random sub-sampling validation as in \sectionref{sec:experiments-and-results:extraction-of-the-roi} was used for testing.

To evaluate the accuracy of the model, an output was classified as correct if the nearest class to the continuous output was the class of the label. Evaluation on the test set resulted in a mean accuracy of 93.0\% for the \textit{AP} view and 95.1\% for the \textit{LAT} view, with a mean absolute error of 0.19 for \textit{AP} and 0.20 for \textit{LAT}. 
Over the 12 runs the standard deviation is 0.025 and 0.02 and the median accuracy 93.4\% and 95.4\% for \textit{AP} and \textit{LAT}, respectively.
The classification into \textit{diagnostic} and \textit{non-diagnostic} (see \sectionref{sec:datasets:diagnostic-quality-dataset}) resulted in an accuracy of 97.8\% for the \textit{AP} view and 93.2\% for the \textit{LAT} view. 
This accuracy shift is because there are different distributions of \textit{1}s and \textit{3}s in the \textit{AP} and \textit{LAT} parts of the dataset.

To evaluate whether the accuracy of the quality assessment benefits from the steps described in \sectionref{subsec:proposed-framework:recognition-of-the-radiographic-view,subsec:proposed-framework:extraction-of-the-roi}, we repeated the training with and without these steps. The results, which are given in \tableref{tab:experiments-and-results:quality-prediction:preprocessing-results}, show, that each step of the pipeline improves the accuracy. Overall, the mean accuracy improves from 82.4\% to 94.1\% when all steps are included. While the benefit of training separately for the different views is small, the extraction of ROIs seems to be necessary to obtain high accuracy. When trained without the previous view recognition, a single model is trained on the combined \textit{AP} and \textit{LAT} data to predict the quality of both views. For this each view is sampled equally often.

To get an estimation on how accurate the labels are, we tested each labeling radiologist against the others, taking one label as prediction and the mean of the remaining three as ground truth. If the difference between prediction and ground truth was at least 1, the prediction was counted as wrong. This resulted in a mean accuracy of 92.6\% for \textit{AP} and 90.1\% for \textit{LAT}.
Across the four radiologists the standard deviation is 0.026 and 0.037 for \textit{AP} and \textit{LAT}, respectively.
The mean accuracy over both views is 94.1\% for the networks and 91.4\% for the radiologists. Although our method, for evaluating the performance of the radiologists, is based on only four experts it should suffice as a first estimate. 

A visual comparison of the expert labels and framework predictions on the unlabeled dataset can be seen in the Appendix in \figureref{fig:appendix:example:ap} for the \textit{AP} view and \figureref{fig:appendix:example:lat} for the \textit{LAT} view.
For further illustration the ROIs with the highest error between expert label and predicted quality are shown in \figureref{fig:appendix:example:bad-predictions}. Note that there is no clear pattern that explains the deviation.

\begin{table}[htbp]
    \begin{minipage}[t]{0.45\textwidth}
        % Caption and label go in the first arguments and the figure contents
        % go in the last argument
        \floatconts
        {tab:experiments-and-results:quality-prediction:preprocessing-results}
        {\caption{Accuracy of quality assessment depending on the steps \textit{View Recognition} (\sectionref{subsec:proposed-framework:recognition-of-the-radiographic-view}) and \textit{ROI Extraction} (\sectionref{subsec:proposed-framework:extraction-of-the-roi}). Not training separately for \textit{AP} and \textit{LAT} and not extracting the ROI leads to the lowest accuracy. Both steps on their own increased the accuracy, while using both provided the best result.}}
        {%
        \begin{tabular}{@{}cclll@{}}
        \toprule
        \multirow{2}{*}{\makecell{View\\Recog.}} & \multirow{2}{*}{\makecell{ROI\\Ext.}} & \multicolumn{3}{l}{Accuracy} \\ \cmidrule(l){3-5} 
                                                 &                                       & mean     & \textit{AP}      & \textit{LAT}  \\ \midrule
        \xmark                                   & \xmark                                & 82.4\%   & 80.3\%  & 84.5\%  \\
        \cmark                                   & \xmark                                & 85.1\%   & 82.9\%  & 87.2\%  \\
        \xmark                                   & \cmark                                & 92.4\%   & 92.2\%  & 92.5\%  \\
        \cmark                                   & \cmark                                & 94.1\%   & 93.0\%  & 95.1\%  \\ \bottomrule
        \end{tabular}
        }
    \end{minipage}
    \hfill
	\begin{minipage}[t]{0.5\textwidth}
        % Caption and label go in the first arguments and the figure contents
        % go in the last argument
        \floatconts
        {tab:experiments-and-results:tasks-results}
        {\caption{Overview of all steps in the framework and their results. The results for the \textit{View Recognition} and the \textit{Quality Assessment} are the achieved accuracy. For the \textit{ROI Extraction} the result is the achieved Dice score. The \textit{AP} and \textit{LAT} results are not from the same model, because we trained individually for each view. Since this is not the case for the \textit{View Recognition}, there is only a single accuracy.}}
        {%
        \begin{tabular}{@{}llll@{}}
        \toprule
        Step               & \multicolumn{3}{l}{Accuracy or Dice}                                          \\ \cmidrule(l){2-4} 
                           & mean   & \textit{AP}                   & \textit{LAT}                   \\ \midrule
        View Recognition   & 99.5\% & \multicolumn{1}{c}{\textendash} & \multicolumn{1}{c}{\textendash} \\
        ROI Extraction     & 90.1\% & 94.2\%                          & 85.9\%                          \\
        Quality Assessment & 94.1\% & 93.0\%                          & 95.1\%                          \\ \bottomrule
        \end{tabular}
        }
    \end{minipage}
\end{table}
