Quantity or Certainty: Can Ambiguously Annotated Data Improve Lung Nodule Detection Performance?
Abstract: Purpose:
In the field of deep learning (DL), constructing a large and well-curated dataset is essential. However, when developing DL-based computer-aided detection (CADe) models for pulmonary nodule detection in computed tomography scans, observer variability poses challenges to dataset curation. For example, under the LUNA16 dataset protocol, a nodule is identified only if at least three out of four observers agree on its presence. Therefore, this study aims to analyze how ambiguously annotated data affects model performance.
Methods and Materials:
We collected a dataset consisting of 891 nodules identified by the radiologist across 223 cases sourced from the United States. In this study, the dataset was then divided into a 7-fold with case-level splits, with 161 cases (5 folds) containing 624 nodules for the training set and the remaining 62 cases (2 folds) containing 267 nodules for the test set. These nodules were re-annotated by a new radiologist. The training set was divided into Group A, consisting of 376 nodules on which both radiologists agreed, and Group B, consisting of 248 nodules on which only the initial radiologist agreed. Three models were subsequently trained with different groups to analyze the impact of ambiguously annotated data: 1) a model trained with Group A, 2) a model trained with Groups A and B, and 3) a model trained with Group B. All models used the same ResNext network architecture and were trained for approximately 100,000 iterations. The performance of these models was evaluated on 52 cases from the test set, containing 154 nodules approved by all radiologists.
Results:
Model performance was analyzed using the Free-Response ROC curve. The model trained with Group A achieved an average sensitivity of 0.880 across thresholds of 1, 2, 4, 8, and 16 false positives per scan. The model trained with Groups A and B showed an average sensitivity of 0.803, while the model trained with Group B, which was approved by only one radiologist, had the lowest average sensitivity of 0.729.
Conclusions:
This study demonstrates that the performance of CADe systems improves when trained with data where multiple annotators consistently agree on a nodule. Conversely, an increase in ambiguous data leads to decreased performance. The results indicate that using reliable data verified by multiple experts enhances CADe system effectiveness, making lung nodule screening more accurate and efficient in clinical settings.
Clinical Relevance/Application:
Advancements in CADe development are important for early lung cancer screening through improved nodule detection. The findings of this study enable more accurate identification of nodules, thereby enhancing early lung cancer detection.
Loading