- Abstract: We address the problem of open-set authorship verification, a classification task that consists of attributing texts of unknown authorship to a given author when the unknown documents in the test set are excluded from the training set. We present an end-to-end model-building process that is universally applicable to a wide variety of corpora with little to no modification or fine-tuning. It relies on transfer learning of a deep language model and uses a generative adversarial network and a number of text augmentation techniques to improve the model's generalization ability. The language model encodes documents of known and unknown authorship into a domain-invariant space, aligning document pairs as input to the classifier, while keeping them separate. The resulting embeddings are used to train to an ensemble of recurrent and quasi-recurrent neural networks. The entire pipeline is bidirectional; forward and backward pass results are averaged. We perform experiments on four traditional authorship verification datasets, a collection of machine learning papers mined from the web, and a large Amazon-Reviews dataset. Experimental results surpass baseline and current state-of-the-art techniques, validating the proposed approach.
- Keywords: authorship verification, transfer learning, language modeling
- TL;DR: We propose and end-to-end model-building process that is universally applicable to a wide variety of authorship verification corpora and outperforms state-of-the-art with little to no modification or fine-tuning.