Maximum a posteriori estimation and filtering algorithm for numerical label noise

Gaoxia Jiang, Zhengying Li, Wenjian Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Appl. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Data quality, especially label quality, may have a significant impact on the prediction accuracy in supervised learning. Training on datasets with label noise causes a degradation in performance and a reduction in prediction accuracy. To overcome the numerical label noise problem in regression, we estimate the posterior distribution of the true label through the Gaussian mixture model (GMM). Then, label noise estimation is proposed by integrating the idea of maximum a posteriori (MAP) estimation with the posterior distribution. Besides, a noise filtering algorithm with MAP estimation (MAPNF) is designed by combining the optimal sample selection framework with the estimator. Extensive experiments are carried out on benchmark datasets and an age estimation dataset to verify the effectiveness of MAPNF. The results on benchmark datasets show that MAPNF outperforms other latest filtering algorithms in improving the generalization performance of different regression models, including noise-sensitive models and noise-robust models. The model error can be reduced by 29.7% to 69.6%. Our proposed approach can also identify erroneous labels in an age estimation dataset (total of 18424). The model trained on the filtered dataset (19% of the data removed) achieves a reduced test error on the dataset by at least 2.68%. The results demonstrate a less-is-better effect by achieving lower prediction errors with fewer high-quality samples. It can be concluded that MAPNF can effectively identify label noise and optimize the data quality.