Keywords: AI creativity, evaluation methodology, human-grounded evaluation
Abstract: We introduce creativity coverage, a novel framework for evaluating large language model (LLM) creativity as a boundary rather than a scalar. Unlike existing methods that measure proximity to human creative standards, our approach identifies hard limits: which regions of human creative space can LLMs reach, and which remain beyond their grasp? This formulation aligns with theories of transformational creativity, which emphasize moving beyond known conceptual boundaries rather than performing well within them. We define human creativity boundaries using the distribution of human responses in a shared semantic embedding space, then measure LLM coverage over this space. Across divergent thinking, convergent reasoning, and creative writing tasks, we find that creative boundaries are strongly task-dependent: models achieve high coverage on structured tasks but occupy only a narrow subset of human space in open-ended writing. Our metric correlates with established diversity measures yet provides distinct information. We further identify specific linguistic features—narrative length, lexical specificity, novel entities—that characterize human creativity beyond model reach, offering actionable insights for improving LLM creative capabilities.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: analysis, automatic evaluation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 6956
Loading