Learning with Data Sampling Biases for Natural Language Understanding

Anonymous

Learning with Data Sampling Biases for Natural Language Understanding

Anonymous

17 Apr 2022 (modified: 05 May 2023)ACL ARR 2022 April Blind SubmissionReaders: Everyone

Abstract: In recent years, NLP models have dramatically improved by utilizing user data, enabling commercial products such as chat bots and smart voice agents.However, data collected for training such models may suffer from sampling biases, conditioned on the dataset collection protocol. Additionally, a practitioner may not always obtain datasets of the desired volumes, particularly given the emerging privacy considerations (e.g. relying on a user to donate their data for model-building purposes).In this paper, we simulate various scenarios under which one may obtain biased training datasets for the task at hand.We build baselines simulating various biased data sampling conditions and present observations such a biased data collection that obtains data-points away from class centroids offer more value. We also test two sets of data augmentation algorithms: (i) pseudo-labeled data through semi-supervised learning, assuming availability of unlabeled data and, (ii) data augmentation through synthetic data generation. We observe that while the best performing data augmentation method depends on the biased setting and the dataset, simple data augmentation algorithms (such as Easy Data Augmentation) are still largely effective.

Paper Type: long

0 Replies

Loading