Synthetic text augmentation for non-topical classification: a case of document genre

Anonymous

Synthetic text augmentation for non-topical classification: a case of document genre

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone

Abstract: While the task of non topical text classification (e.g. document genre, author profile, sentiment, etc.) has been recently improved due to pre-trained language models (e.g. Bert), it has been observed that the resulting classifiers suffer from a performance gap when applied to new domains. E.g. a genre classifier trained on political topics often fails when tested on documents about sport or medicine. In this work, 1) We develop a robust method to quantify this phenomenon empirically. 2) We verify that domain transfer in non-topical classification remains challenging even for the modern pre-trained models, and 3) we test a data augmentation approach which involves training texts generators in any desired genre and on any topic, even when there are no documents in the training corpus that are both in that particular genre and on that particular topic. We empirically verify that augmenting the training dataset with the synthetics documents generated by our approach facilitates domain transfer, so that the model can correctly predict genres that don't have ``on-topic" examples in the training set. The "off-topic" F1 score can be improved for some topics as much as from 57.6 to 73.0.

Paper Type: long

Research Area: Resources and Evaluation

0 Replies

Loading