Evaluating LLM Creativity as Long-Tail Performance

Yichen Wang; Mina Lee

Evaluating LLM Creativity as Long-Tail Performance

Yichen Wang, Mina Lee

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM creativity, Evaluation, Taxonomy, Long-tail distribution

Abstract: Creativity evaluations of large language models produce inconsistent results across benchmarks and with human judgment. We take a step toward explaining these disagreements through three contributions: a taxonomy of evaluation configurations, a theoretical proposition that LLM creativity is performance under the long tail of a reference distribution, and controlled experiments verifying that existing benchmarks cover distinct subregions of this tail. This explains why inter-benchmark agreement is moderate but not perfect, and why no single benchmark suffices. We further show that existing creativity-improving methods tend to improve performance within specific evaluation strategies rather than broadly, cautioning against overclaiming general creativity gains.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 182

Loading