PromptAug: Data Augmentation for Fine Grained Conflict Identification

ACL ARR 2024 June Submission3200 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Following the garbage in garbage out maxim, the quality of training data supplied to machine learning models impacts their performance. Generating these high-quality annotated training sets from unlabelled data is both expensive and unreliable. Moreover, social media platforms are increasingly limiting academic access to data, eliminating a key resource for NLP research. Consequently, researchers are shifting focus towards text data augmentation strategies to overcome these restrictions. In this work, we present an innovative data augmentation method, PromptAug, using Large Language Models (LLMs). We demonstrate the effectiveness of PromptAug, with improvements over the baseline dataset of 2\% accuracy and 5\% F1-score. Furthermore, we evaluate PromptAug over a variety of dataset sizes, proving it's effectiveness even in extreme data scarcity scenarios. To ensure a thorough evaluation of data augmentation methods we further perform qualitative thematic analysis, identifying four problematic themes with augmented text data; Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation.
Paper Type: Long
Research Area: Generation
Research Area Keywords: human behavior analysis, quantitative analyses of social media, data augmentation, NLP in resource-constrained settings, few shot generation, prompting, generative models, generalization, hate speech detections
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 3200
Loading