PrismAI: An Environment for AI-generated Text Recognition

ACL ARR 2025 February Submission7538 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce **PrismAI**, an environment for the automatic detection of AI-generated text. Our contributions are threefold: Firstly, we release the largest AI-detection dataset to date, comprising 537588 human-written and AI-generated documents in both English and German across seven domains, including scientific writing, weblogs, parliamentary speeches, legal court cases, classic literature, news articles, and student essays, synthesized using state-of-the-art models. Secondly, we introduce **Luminar**, a CNN-based model for the automatic detection of AI-generated texts. Our experiments show that by leveraging the hidden states of an LLM to derive intermediate likelihoods, our model, despite having a small footprint, can outperform other likelihood-backed baselines significantly while demonstrating strong generalization capabilities in out-of-domain and out-of-language scenarios. Thirdly, we unify existing datasets into a common corpus called **AIGT-World** and make it accessible through a publicly available web-based corpus explorer, which facilitates searching, reading, visualizing, and interacting with the underlying data. By doing so, we aim to elevate research in this area, expand the field to include non-English texts, propose new models, and unify existing efforts to build toward a common dataset and objective.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: generative models, data augmentation, multilingual evaluation, corpus creation, language resources, multilingual corpora, generalization
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, German
Submission Number: 7538
Loading