Abstract: Following the garbage in garbage out maxim, the quality of training data supplied to machine learning models impacts their performance. Generating these high-quality annotated training sets from unlabelled data is both expensive and unreliable. Moreover, social media platforms are increasingly limiting academic access to data, eliminating a key resource for NLP research. Consequently, researchers are shifting focus towards text data augmentation strategies to overcome these restrictions. In this work, we present an innovative data augmentation method, PromptAug, focusing on the design of distinct prompt engineering techniques for Large Language Models (LLMs). We concentrate on Instruction, Context, Example, and Definition prompt attributes, empowering LLMs to generate high-quality, class-specific data instances without requiring pre-training.We demonstrate the effectiveness of PromptAug, with improvements over the baseline dataset of 2\% accuracy, 5\% F1-score, 5\% recall, and 2\% precision. Furthermore, we evaluate PromptAug over a variety of dataset sizes, proving it's effectiveness even in extreme data scarcity scenarios. To ensure a thorough evaluation of data augmentation methods we further perform qualitative thematic analysis, identifying four problematic themes with augmented text data; Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation.
Paper Type: long
Research Area: Computational Social Science and Cultural Analytics
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English
0 Replies
Loading