Abstract: As the capabilities of artificial intelligence
(AI) grow, language models can now produce 'very close’ human-like poetry. However, it remains unclear whether people can reliably
detect which works were written by humans and which were generated by machines. We conduct a Turing-inspired experiment comparing human and AI detection capabilities using a dataset of 300 incomplete poems completed by GPT-4.o, Gemini 1.5, and Llama 3.2. Five human evaluators achieved 95.8% mean accuracy in distinguishing human vs AI continuations, while cross-model evaluations peaked at 55% accuracy.
These findings highlight that, for now, human expertise remains important for creating and distinguishing poetry work.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Poetry, Machine Learning, NLP, Detection, Turing
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English
Submission Number: 3956
Loading