Masking the Unknown: Leveraging Masked Samples for Enhanced Data Augmentation

Published: 26 Apr 2024, Last Modified: 15 Jul 2024UAI 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data augmentation, Masked Language Model, Text classification, Data variance
Abstract: Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.
List Of Authors: Yao, Xun and Huang, Zijian and Hu, Xinrong and Yang, Jie and Guo, Yi
Latex Source Code: zip
Signed License Agreement: pdf
Submission Number: 596
Loading