DAWSON: Data Augmentation using Weak Supervision On Natural Language

Anonymous

DAWSON: Data Augmentation using Weak Supervision On Natural Language

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: We propose a novel data augmentation model for text, using all available data through weak supervision. To improve generalization, recent work in the field uses BERT and masked language modeling to conditionally augment data. These models all involve a small, high-quality labeled dataset, but omit the abundance of unlabeled data, which is likely to be present if one considers a model in the first place. Weak supervision methods, such as Snorkel, make use of the vastness of unlabeled data, but largely omit the available ground truth labels. We combine data augmentation and weak supervision techniques into a holistic method, consisting of 4 training phases and 2 inference phases, to efficiently train an end-to-end model when only a small amount of annotated data is available. We outperform the benchmark (Kumar et al.,2020) for the SST-2 task by 1.5, QQP task by 4.4, and QNLI task by 3.0 absolute accuracy points, and show that data augmentation is also effective for natural language understanding tasks, such as QQP and QNLI.

0 Replies

Loading