Bounded Loss Robustness: Enhancing the MAE Loss for Large-Scale Noisy Data Learning

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: noisy dataset, label noise, noise-robust loss, logit bias, image classification
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: By analyzing the backpropagation error of bounded losses, we show why bounded losses struggle to learn many-class datasets and develop a method that enables the Mean Absolute Error to learn datasets with many classes..
Abstract: Large annotated datasets inevitably contain noisy labels, which poses a major challenge for training deep neural networks as they easily fit the labels. Noise-robust loss functions have emerged as a notable strategy to counteract this issue, with symmetric losses, a subset of the bounded losses, displaying significant noise robustness. Yet, the class of symmetric loss functions might be too restrictive, with functions such as the Mean Absolute Error (MAE) being susceptible to underfitting. Through a quantitative approach, this paper explores the learning behavior of bounded loss functions, particularly the limited overlap between the network output at initialization and non-zero derivative regions of the loss function. We introduce a novel method, "logit bias", which adds a real number, denoted as $\epsilon$, to the logit at the correct class position. This method addresses underfitting by restoring the overlap, enabling MAE to learn, even on datasets like WebVision, consisting of over a million images from 1000 classes. Extensive numerical experiments show that MAE, in combination with our proposed method, can compete with state-of-the-art noise robust loss functions. Remarkably, our method relies on a single parameter, $\epsilon$, which is determined by the number of classes, resulting in a method that uses zero dataset or noise-dependent hyperparameters.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 101
Loading