GB: Combating Textual Label Noise by Granular-ball based Robust Training

Published: 01 Jan 2024, Last Modified: 26 Aug 2024ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Most natural language processing tasks rely on massive labeled data to train an outstanding neural network model. However, the label noise (i.e., wrong label) is inevitably introduced when annotating large-scale text datasets, which significantly degrades the performance of neural network models. To overcome this dilemma, we propose a novel <u>G</u>ranular-<u>B</u> all based t<u>RAIN</u>ing framework, named GBRAIN, to realize robust coarse-grained representation learning, thus combating label noises in diverse text tasks. Specifically, considering that most samples in the dataset are precisely labeled, GBRAIN first proposes a dynamic granular-ball clustering algorithm to blend seamlessly into the traditional neural network model. A striking feature of the clustering algorithm is that it can adaptively group the embedding vectors of similar data into the same set (hereafter referred to as a granular-ball). The embedding vectors and labels of all samples from the same set will be coarse-grainedly represented by the center vector and the label of the granular-ball, respectively. Consequently, noise labels can be rectified through the labels of most of the labeled data. Moreover, we introduce a new gradient backpropagation mechanism compatible with our framework, which can help optimize coarse-grained embedding vectors with iterative training. Empirical results on text classification and name entity recognition tasks demonstrate that our proposal GBRAIN is indeed effective in contrast to the state-of-the-art baselines.
Loading