On Revisiting Entropy for Identifying Mislabeled Images

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets---a domain particularly susceptible to labeling errors due to diagnostic complexity---spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.
Lay Summary: Mislabeled samples in training data can seriously reduce the reliability of deep learning models. This is especially challenging because models can eventually memorize incorrect labels rather than ignore them. This paper proposes a method for detecting mislabeled samples by examining how a model learns over time. The key idea is that correctly labeled samples usually become less uncertain as training progresses, while mislabeled samples tend to remain relatively uncertain for much longer. Based on this observation, we introduce a score called the signed entropy integral, or SEI. SEI summarizes both the overall level of uncertainty and how that uncertainty changes during training. The method can be applied to general classification models and works particularly well with CLIP-based models. Experiments on five datasets show that SEI identifies mislabeled samples more accurately than existing methods, while remaining simple to implement and computationally efficient.
Primary Area: Applications->Health / Medicine
Keywords: mislabeled data detection, signed entropy integral, contrastive language-image pretraining (CLIP)
Originally Submitted PDF: pdf
Submission Number: 14123
Loading