Keywords: performance estimation, label-free, postmarket surveillance
Abstract: Performance monitoring is essential for safe clinical deployment
of image classification models. However, because ground-truth labels
are typically unavailable in the target dataset, direct assessment of
real-world model performance is infeasible. State-of-the-art performance
estimation methods address this by leveraging confidence scores to estimate
the target accuracy. Despite being a promising direction, the established
methods mainly estimate the model’s accuracy and are rarely
evaluated in a clinical domain, where strong class imbalances and dataset
shifts are common. Our contributions are twofold: First, we introduce
generalisations of existing performance prediction methods that directly
estimate the full confusion matrix. Then, we benchmark their performance
on chest x-ray data in real-world distribution shifts as well as
simulated covariate and prevalence shifts. The proposed confusion matrix
estimation methods reliably predicted clinically relevant counting metrics
on medical images under distribution shifts. However, our simulated
shift scenarios exposed important failure modes of current performance
estimation techniques, calling for a better understanding of real-world
deployment contexts when implementing these performance monitoring
techniques for postmarket surveillance of medical AI models.1
Submission Number: 8
Loading