An interpretable sample selection framework against numerical label noise

Gaoxia Jiang, Wenjian Wang

Published: 01 Jan 2025, Last Modified: 03 Mar 2025Mach. Learn. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Numerical label noise in regression would misguide the model training and worsen the generalization performance. As a popular technique, noise filtering reduces the noise level by removing mislabeled samples. Some filters care about the noise level so much that a few clean samples are also removed. The existing optimal sample selection (OSS) framework balances the number of removals and the noise level to avoid overcleaning. However, its underlying interpretability is unobvious due to the complicated objective function, and inaccurate parameter estimates or settings may discount the filtering effect. To address these issues, we first propose a novel interpretable sample selection (ISS) framework against numerical label noise. It seeks to maximize the number of available samples while having a relatively low noise level. ISS converges to OSS in \(\mathcal {O}(1/\log n)\), guaranteeing the good generalization performance inherited from OSS on large-scale datasets. Also, we prove its adaptability, thus ensuring that ISS is effective in changing noisy environments. Secondly, we propose a robust and low-deviation noise estimator, namely embedded covering distance. Finally, an embedded covering distance filtering (ECDF) algorithm is presented as part of the ISS framework. Experimental results on benchmark and real-world datasets show that the proposed ECDF algorithm outperforms the state-of-the-art filtering approaches against numerical label noise.