Abstract: Extracting noisy or incorrectly labeled samples from a
labeled dataset with hard/difficult samples is an important,
and yet under-explored topic. Current methods focus in
general either on noisy labels, or on hard samples, but
not jointly on both. When the two types of data are
both present, these methods often fail to distinguish them,
which results in a decline in the overall performance of
the model. We propose a systematic empirical study that
provides insights into the similarities and more importantly
the differences between hard and noisy samples. The
method consists of designing synthetic datasets customized
with different hardness and noisiness levels for different
samples. These controlled experiments pave the way for
the evaluation and development of methods that distinguish
between hard and noisy samples. We evaluate how various
data-partitioning methods are able to remove noisy samples
while retaining hard samples. Our study highlights the
advantages of using a metric in data partitioning that we
propose and call static centroid distance. The resulting datapartitioning method outperforms others: It leads to a high
test accuracy on models trained on the filtered datasets,
as shown both for datasets with synthetic label noise and
for datasets with real-world label noise. It also significantly
outperforms other methods when employed within a semisupervised learning framework.
Loading