FastSimiFeat: A Fast and Generalized Approach Utilizing-NN for Noisy Data Handling

Published: 01 Jan 2024, Last Modified: 06 Feb 2025CIKM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As deep learning technologies continue to evolve, the challenge of training neural networks with noisy data becomes increasingly critical. Incorrect labels, which often result in the model's tendency to memorize incorrect data--a phenomenon known as the memorization effect--significantly undermine both performance and the ability to generalize. Traditional methods to address noisy labels typically involve extensive modifications during training, leading to prolonged refinement processes. Although some recent approaches eliminate the need for retraining by using pre-trained models, they still face challenges with lengthy refinement times and inaccurate noise ratio estimations. In response, we introduce FastSimiFeat, a novel algorithm that utilizes the k-nearest neighbors (k-NN) technique on feature vectors derived from pre-trained models efficiently. This training-free method incorporates a new confusion matrix-based noise ratio estimator that significantly reduces the need for iterative refinement by adapting the number of k-NN cycles based on the detected noise level. Additionally, we propose an innovative label correction method that leverages potentially noisy data to enhance model robustness and generality. Our extensive evaluations on both synthetic and real-world datasets demonstrate that FastSimiFeat not only minimizes refinement time but also consistently outperforms existing methods in terms of accuracy. These results confirm the suitability of FastSimiFeat for industrial applications where reliable data processing is paramount. By leveraging inherent features of neural networks pre-trained on large datasets, FastSimiFeat sets a new standard for minimal-dependency approaches in noisy data environments, facilitating the deployment of more reliable and efficient deep learning models across various sectors.
Loading