Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation

Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation

ACL ARR 2025 February Submission2490 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for downstream model fine-tuning. This is useful, especially for low-resource settings. For better augmentations, LLMs are prompted with examples (few-shot scenarios). Yet, the samples are mostly selected randomly, and a comprehensive overview of the effects of other (more ''informed'') sample selection strategies is lacking. In this work, we compare sample selection strategies existing in the few-shot learning literature and investigate their effects in LLM-based textual augmentation in a low-resource setting. We evaluate this on in-distribution and out-of-distribution model performance. Results indicate that while some ''informed'' selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: data augmentation, analysis

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings

Languages Studied: English

Submission Number: 2490

Loading