Handling imbalanced textual data: an attention-based data augmentation approach

Amit Kumar Sah, Muhammad Abulaish

Published: 01 Jan 2025, Last Modified: 07 Oct 2025Int. J. Data Sci. Anal. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present an attention-based data augmentation (ADA) approach to address the issue of poor performance of classification algorithms on imbalanced text datasets. The proposed approach begins by ranking the vocabulary of the minority class based on its similarity with the minority class dataset, using a vector similarity measure. Subsequently, it generates a dataset from the minority class instances considering the high-ranking and low-ranking words. It then employs an attention mechanism to extract important contextual words from the documents corresponding to high-ranking words in the generated dataset. Finally, it employs the important contextual words identified by substituting them with their most appropriate contextual and semantic equivalent to enhance the dataset(s) of the minority class using masked language modeling (MLM). ADA oversamples and balances the training dataset by augmenting the newly generated documents to its minority class instances. We investigate the significance of the oversampled and balanced dataset generated by ADA for document classification tasks using BiLSTM on both binary class and multiclass versions of six publicly available Amazon reviews datasets. We compare the performance of ADA against four well-known text data augmentation techniques, out of which one is random operations-based, one is label knowledge-based, and two are MLM-based approaches that employ random masking to generate new documents for augmentation purpose. We also compared ADA with an ablation-like baseline. The experimental results reveal that the classification performance of BiLSTM on the original imbalanced datasets is poor compared to the augmented datasets created using any data augmentation technique. Furthermore, BiLSTM exhibits superior performance when trained over datasets that are balanced using ADA, as opposed to the balancing that is achieved through comparison approaches. It highlights the effectiveness of identifying and masking important contextual words of minority class instances using MLM, as opposed to random operations, basic label knowledge-based and MLM-based approaches, which employ random masking.

External IDs:dblp:journals/ijdsa/SahA25