Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity

ICLR 2026 Conference Submission19846 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: creativity, creative writing, evaluation, creativity evaluation, machine creativity, n-gram novelty
TL;DR: Study with expert writers cautions against using n-gram novelty for creativity evaluation. Open-source LLMs tend to sound less pragmatic as n-gram novelty increases. Evaluation of close reading skills of frontier and fine-tuned LLMs.
Abstract: $N$-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and $n$-gram novelty through 7542 expert writer annotations ($n=26$) of novelty, pragmaticality, and sensicality via \emph{close reading} of human and AI-generated text. We find that while $n$-gram novelty is positively associated with expert writer-judged creativity, $\approx 91\%$ of top-quartile expressions by $n$-gram novelty are not judged as creative, cautioning against relying on $n$-gram novelty alone. Furthermore, unlike human-written text, higher $n$-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19846
Loading