BERT Goes Off-Topic: Investigating the Domain Transfer Challenge in Genre ClassificationDownload PDF

Anonymous

17 Apr 2023ACL ARR 2023 April Blind SubmissionReaders: Everyone
Abstract: While performance of many text classification tasks has been recently improved due to pre-trained language models (e.g. BERT), in this paper we show that they still suffer from a performance gap when the distribution of topics changes in the case of genre classification. For example, a genre classifier trained on political topics often fails when tested on documents about sport or medicine. In this work, 1) We develop methods to quantify this phenomenon empirically. 2) We verify that domain transfer in genre classification remains challenging even for the pre-trained models, and 3) we develop a data augmentation approach by generating texts in any desired genre and on any desired topic, even when there are no documents in the training corpus that are both in that particular genre and on that particular topic. We empirically verify that augmenting the training dataset with the synthetics documents facilitates domain transfer, so that the model can correctly predict genres that lack ``on-topic" examples in the training set. F1 classification metric has been improved for some topics as much as from .39 to .65
Paper Type: long
Research Area: Resources and Evaluation
0 Replies

Loading