Causal Language Model Perplexity for Human Authorship AttributionDownload PDF

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone
Abstract: In this paper, we introduce an authorship attribution method that identifies the most likely author of a questioned document based on the perplexity of the questioned document calculated for a set of GPT-2 models fine-tuned on the writings of each candidate author. We evaluate our method on corpora representing the writings of 50 fiction authors. We find that the perplexity of causal large language models is able to distinguish among these 50 authors with an overall f-score of 0.99 and a macro average accuracy of 0.99, considerably outperforming other state-of-the-art methods applied to other datasets with similar numbers of authors. We also test how the performance of our method depends on the length of the questioned document and the amount of training data for each author. We find that to reach a 0.90 f-score with 50 possible authors via our method, the minimum training data required is 28,000 tokens, while the minimum testing data required is 70 tokens.
Paper Type: short
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Contribution Types: NLP engineering experiment
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies

Loading