Evaluating LLM Creativity as Long-Tail Performance

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: LLM creativity, Evaluation, Taxonomy, Long-tail distribution
Abstract: Creativity evaluations of large language models produce inconsistent results across benchmarks and with human judgment. We take a step toward explaining these disagreements through three contributions: a taxonomy of evaluation configurations, a theoretical proposition that LLM creativity is performance under the long tail of a reference distribution, and controlled experiments verifying that existing benchmarks cover distinct subregions of this tail. This explains why inter-benchmark agreement is moderate but not perfect, and why no single benchmark suffices. We further show that existing creativity-improving methods tend to improve performance within specific evaluation strategies rather than broadly, cautioning against overclaiming general creativity gains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 182
Loading