Towards textual data augmentation for neural networks: synonyms and maximum loss

Michal Jungiewicz, Aleksander Smywinski-Pohl

Published: 2019, Last Modified: 16 Apr 2024Comput. Sci. 2019Readers: Everyone

Abstract: Data augmentation is one of the ways of dealing with labeled data scarcity and overfitting. Both these problems are crucial for modern deep learning algorithms which require massive amounts of data. The problem is better explored in the context of image analysis than for text. This work is a step forward to close this gap. We propose a method for augmenting textual data when training convolutional neural networks for sentence classification. The augumentation is based on the substitution of words using a thesaurus as well as the Princeton WordNet. Our method improves upon the baseline in almost all cases. In terms of accuracy the best of the variants is 1.2% (pp.) better than the baseline.

0 Replies