Differences between hard and noisy-labeled samples: An empirical study

Mahsa Forouzesh

Published: 17 Apr 2024, Last Modified: 10 Jan 2025SIAM International Conference on Data Mining (SDM24)EveryoneCC BY 4.0

Abstract: Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important, and yet under-explored topic. Current methods focus in general either on noisy labels, or on hard samples, but not jointly on both. When the two types of data are both present, these methods often fail to distinguish them, which results in a decline in the overall performance of the model. We propose a systematic empirical study that provides insights into the similarities and more importantly the differences between hard and noisy samples. The method consists of designing synthetic datasets customized with different hardness and noisiness levels for different samples. These controlled experiments pave the way for the evaluation and development of methods that distinguish between hard and noisy samples. We evaluate how various data-partitioning methods are able to remove noisy samples while retaining hard samples. Our study highlights the advantages of using a metric in data partitioning that we propose and call static centroid distance. The resulting datapartitioning method outperforms others: It leads to a high test accuracy on models trained on the filtered datasets, as shown both for datasets with synthetic label noise and for datasets with real-world label noise. It also significantly outperforms other methods when employed within a semisupervised learning framework.