AI is Misled by GenAI: Stylistic Bias in Automated Assessment of Creativity in Large Language Models
Track: Paper
Keywords: creativity, automated assessment, large language models, bias
TL;DR: Because LLM responses have a distinctive style, automated creativity metrics judge them as more original but at the same time less diverse than they actually are.
Abstract: Outputs from large language models (LLMs) are often rated as highly original yet show low variability (or greater homogeneity) compared to human responses, a pattern we refer to as the *LLM creativity paradox*. Yet, prior work suggests that assessments of originality and variability may reflect stylistic features of LLM outputs rather than underlying conceptual novelty. The goal of the present study was to investigate this issue using outputs from seven distinct LLMs on a modified Alternative Uses Task. We scored verbatim and "humanized" LLM responses—reworded to reduce verbosity but maintain core ideas—using four automated metrics (supervised OCSAI and CLAUS models, and two unsupervised semantic-distance tools) and compared them with responses from 30 human participants. As expected, verbatim LLM responses were rated as substantially more original than human responses (median $d = 1.46$) but showed markedly lower variability (median $d = 0.85$). Humanizing the responses strongly decreased originality and weakly increased variability, indicating that part of the LLM creativity paradox is driven by stylistic cues. Nevertheless, even after humanization, originality scores of LLM responses remained higher (median $d = 0.80$) and their variability lower ($d = 0.57$) than those of human responses. These findings suggest that automated assessment tools can be partially misled by the style of LLM outputs, highlighting the need for caution when using automated methods to evaluate machine-generated ideas, particularly in real-world applications such as providing feedback or guiding creative workflows.
Submission Number: 24
Loading