Subword Information for Authorship Attribution: A Deep Learning ApproachDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Authorship attribution is the process of unveiling the hidden identity of authors from a corpus of literary data. Many previous works on authorship attribution employed word-based models to capture an author's distinctive writing style. The vocabulary of the training corpus is heavily dependent on the pre-trained word vectors, which limits the performance of these models. Alternate methods using character-based models proposed to overcome the rare word problems arising from different linguistic features fail to capture the sequential relationship of words inherently present in the texts. The question we addressed in this paper is whether it is possible to tackle the ambiguity of hidden writing style (or words) as we introduce Gaussian noise while preserving the sequential context of the text to improve authorship-related tasks. In this work, we propose two bidirectional long short-term memory (BLSTM) with a 2D convolutional neural network (CNN) over a two-dimensional pooling operation to capture sequential writing styles for distinguishing different authors. To determine the appropriate writing style representation, we used BLSTM to obtain the sequential relationship between characteristics using subword information and 2D CNN is adopted to understand the local syntactical position of the style from unlabelled input text. We extensively evaluate the model that leverages subword embedding and compare it against state-of-the-art methods for an extensive range of authors. Our methods improve 2.42\%, 0.96\% and 0.97\% on CCAT50, Blog50 and Twitter, respectively and produce comparable results on the remaining one.
0 Replies

Loading