Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge

Tan Min Sen; Zachary Choy Kit Chun; Swaagat Bikash Saikia; Syed Ali Redha Alsagoff; Banerjee Mohor; Nadya Yuki Wangsajaya; Alvin Chan

Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge

Tan Min Sen, Zachary Choy Kit Chun, Swaagat Bikash Saikia, Syed Ali Redha Alsagoff, Banerjee Mohor, Nadya Yuki Wangsajaya, Alvin Chan

Published: 05 Mar 2025, Last Modified: 20 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Reasoning, Creativity, Benchmark, LLM-as-a-judge

TL;DR: This paper automates creativity evaluation in LLMs using semantic entropy for novelty and a new retrieval-based framework for efficient multi-agent judging of solution quality.

Abstract: Large Language Models (LLMs) have achieved remarkable progress in natural language comprehension, reasoning, and generation, sparking interest in their creative potential. Automating creativity evaluation in LLMs, particularly in physical reasoning tasks, presents a transformative opportunity to accelerate scientific discovery by enabling innovative solutions, uncovering patterns, and automating problem-solving processes. Current creativity evaluation frameworks, however, rely heavily on human annotation, making them subjective, resource-intensive, and impractical for scaling. To address this, we introduce a novel automated evaluation framework rooted in cognitive science principles of divergent and convergent thinking. Divergent creativity is measured using Semantic Entropy, a sampling-based metric that quantifies variability in generated outputs to capture the novelty of ideas. Convergent creativity is assessed using a modified retrieval-based discussion framework—60% more efficient—where autonomous multi-agent systems evaluate task solutions across feasibility, safety, and effectiveness. We implement these methodologies within a benchmark based on the MacGyver dataset, which contains 300 real-world, solvable problems requiring innovative use of everyday objects. Our framework evaluates state-of-the-art LLMs, such as GPT and LLaMA models, while analyzing the effects of key parameters like temperature, model size, and recency. By automating creativity evaluation, we establish a scalable, objective, and reproducible methodology to enhance LLM development, paving the way for breakthroughs in scientific discovery and creative problem-solving across diverse fields.

Submission Number: 74

Loading