Comparative Analysis of Machine Learning and LLM Approaches for Detecting ChatGPT-Written Essays Under Revision Conditions

Published: 25 Jun 2025, Last Modified: 02 Jul 2025IMPS 2024EveryoneRevisionsBibTeXCC BY 4.0
DOI: 10.64028/gnul779674
Keywords: classification, prediction, statistical and machine learning, Generative AI
TL;DR: This paper compares ML and LLM models for detecting ChatGPT-written essays, finding that models like SVM and ELECTRA excel on original AI text but struggle after revisions, indicating challenges to academic integrity in the age of generative AI.
Abstract: ChatGPT, a powerful generative artificial intelligence (AI), can play a significant role in enhancing K-12 education by offering support with various tasks, such as answering questions, solving math problems, and generating content like essays, code, and presentation slides. While it represents an invaluable resource for learning, concerns have arisen regarding its potential misuse by students when completing school assignments. Current commercial detectors, like Grammarly and GPTZero, are designed to identify general text generated by AI, lacking specificity for high-stakes assessments. This study addresses the challenge of detecting the potential use of ChatGPT for academic cheating in high-stakes assessments. Classical machine learning methods, including logistic regression, naïve Bayes, and decision trees, were employed to distinguish between essays generated by ChatGPT and those authored by students. Additionally, pre-trained language models like Roberta and BERT were compared against traditional machine learning approaches. The analysis focused on prompt 1 from the Kaggle Automated Student Assessment Prize (ASAP) competition. To evaluate the effectiveness of the detection methods, four approaches were applied to revise ChatGPT-generated essays: Grammarly Premium, revisions by eighth-grade students, revisions by ninth-grade or above students, and further modifications by ChatGPT with additional prompting to humanize and naturalize the essays by introducing grammatical mistakes. For detecting unmodified ChatGPT essays, Electra, a pre-trained language model, demonstrated a high quadratic weighted Kappa (QWK) score of 97\%, while support vector machine (SVM) outperformed the large language models with a remarkable QWK score of 99\%. The modification methods significantly influence the detection rate crossing various models. This research addresses concerns about academic integrity in high-stakes assessments involving generative AI technologies.
Supplementary Material: zip
Submission Number: 14
Loading