How Effective and Robust is Sentence-Level Data Augmentation for Named Entity Recognition?

Runmin Jiang, Xin Zhang, Jiyue Jiang, Wei Li, Yuhao Wang

Published: 2022, Last Modified: 05 Jul 2023NLPCC (1) 2022Readers: Everyone

Abstract: Data augmentation is a simple but effective way to improve the effectiveness and the robustness of pre-trained models. However, they are difficult to adapt to token-level tasks such as named entity recognition (NER) because of the different semantic granularity and more fine-grained labels. Inspired by some mixup augmentations in computer vision, we proposed three sentence-level data augmentations including CMix, CombiMix, TextMosaic, and adapted them to the NER task. Through empirical experiments on three authoritative datasets (OntoNotes4, CoNLL-03, OntoNotes5), we found that these methods will improve the effectiveness of the models if controlling the number of augmented samples. Strikingly, the results show our approaches can greatly improve the robustness of the pre-trained model even over strong baselines and token-level data augmentations. We achieved state-of-the-art (SOTA) in the robustness evaluation of the CCIR CUP 2021. The code is available at https://github.com/jrmjrm01/SenDA4NER-NLPCC2022 .

0 Replies