A standardized performance evaluation metric for chest CT nodule detection

Doohyun Park, Chanmin Park, Jinyoung Kim

Published: 08 May 2024, Last Modified: 10 Apr 20252024 ESTIEveryoneCC BY 4.0

Abstract: Purpose/Objectives To underscore the problems with using the Free-response Receiver Operating Characteristics (FROC) curve for nodule detection in chest CT, and to suggest a standardized metric for evaluating the performance of the detection model. Methods & Materials The LUNA16 dataset [1] is widely recognized as a benchmark for nodule detection in chest CT scans using deep learning models. In evaluating models using the LUNA dataset, the competition performance metric (CPM) is commonly used, averaging sensitivities at different number of false positives (FPs) per scan. However, in scenarios where the number of nodules per scan increases, a consistent positive predictive value (PPV) necessitates evaluating a greater number of nodule candidates as shown in Figure 1. This could lead to an increased number of FPs per scan, making the FROC curve with FPs per scan potentially less accurate in representing the detection model's performance. In this study, we propose using a FROC curve with FPs per nodule, instead of per scan. To explore this, we analyzed multiple datasets and conducted a simulation study. The simulation generated detection results for 100 scans under two scenarios, with each scan having two and ten nodules, respectively. Both datasets are composed of 10 false candidates per nodule, classified as FPs or true negatives based on decision thresholds determined by the number of FPs per scan. We calculated the CPM for each decision threshold at 0.125, 0.25, 0.50, 1, 2, 4, and 8 FPs, for both per scan and per nodule. For statistical analysis, we repeated simulations 100 times to calculate confidence intervals (CIs) and conducted a paired t-test to analyze differences in CPM between the two datasets. Results As shown in Figure 2, the LUNA16 dataset exhibited an average of 1.97 nodules per scan, considering only scans with nodules. In contrast, the three datasets analyzed in this study had averages of 5.33, 6.52, and 9.91 nodules per scan, respectively, which is a 2.71 to 5.03 times. In fact, there are studies showing that the number of nodules per scan and the range of the FROC curve in chest CT nodule detection vary depending on the dataset [2][3]. As shown in Figure 3, from the simulation study, the CPM (95% CI) for datasets 1 and 2 using the FROC curve with FPs per scan was 47.9% (47.4%–48.5%) and 24.8% (24.5%–25.0%), respectively, with p < 0.001. In contrast, for the FROC curve with FPs per nodule, the CPM was 59.7% (59.2%–60.2%) and 60.2% (60.0%–60.4%) with p=0.11. Conclusion Using the FROC curve with FPs per scan can lead to significant variations in model performance depending on the data characteristics, therefore, employing a FROC curve with FPs per nodule to evaluate the model performance could become a more standardized approach.