DeepClean - Contrastive Learning Towards Quality Assessment in Large-Scale CXR Data Sets

Sofia Cardoso Pereira; João Pedrosa; Joana Rocha; Pedro Sousa; Aurélio Campilho; Ana Maria Mendonça

DeepClean - Contrastive Learning Towards Quality Assessment in Large-Scale CXR Data Sets

Sofia Cardoso Pereira, João Pedrosa, Joana Rocha, Pedro Sousa, Aurélio Campilho, Ana Maria Mendonça

Published: 01 Jan 2024, Last Modified: 08 Apr 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large-scale datasets are essential for training deep learning models in medical imaging. However, many of these datasets contain poor-quality images that can compromise model performance and clinical reliability. In this study, we propose a framework to detect non-compliant images, such as corrupted scans, incomplete thorax X-rays, and images of non-thoracic body parts, by leveraging contrastive learning for feature extraction and parametric or non-parametric scoring methods for out-of-distribution ranking. Our approach was developed and tested on the CheXpert dataset, achieving an AUC of 0.75 in a manually labeled subset of 1,000 images, and further qualitatively and visually validated on the external PadChest dataset, where it also performed effectively. Our results demonstrate the potential of contrastive learning to detect non-compliant images in large-scale medical datasets, laying the foundation for future work on reducing dataset pollution and improving the robustness of deep learning models in clinical practice.

Loading