Regulating the level of manipulation in text augmentation with systematic adjustment and advanced sentence-embedding

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Text augmentation, the level of manipulation, advanced sentence-embedding, reliable pseudo-labels
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This research emphasizes the importance of text augmentation and proposes a solution for the "level of manipulation" issue. It introduces a method that ensures diversity while providing reliable pseudo-labeling through advanced sentence embedding.
Abstract: Text augmentation, a method for generating samples by applying combinations, noise, and other manipulations to a small dataset, is a crucial technique in natural language processing (NLP) research. It introduced diversity into the training process, thereby enabling the construction of robust models. The level of manipulation is the most important issue in text augmentation; low-level manipulation generates data similar to the original, resulting in inefficient augmentation because it cannot ensure diversity, whereas high-level manipulation causes reliability issues for labels and degrades the model's performance. Therefore, this paper proposes a systematically adjustable text augmentation technique to address the ``level of manipulation'' issue. Specifically, it proposes a method for systematically adjusting the data candidate pool for manipulation to provide an appropriate level of randomness during the augmentation process. Furthermore, we propose an advanced sentence-embedding methodology to achieve robust pseudo-labeling at the manipulation level. In other words, we leverage combined sentence embedding, which incorporates sentence embedding, document embedding, and XAI information from the original data to assign reliable pseudo-labels. We conducted performance comparisons with existing text augmentation approaches to validate the effectiveness of our proposed methodology. The experimental results demonstrate that the proposed method achieves the highest performance improvement across all the experimental datasets
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5745
Loading