Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

ACL ARR 2025 May Submission2090 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest. Their study exposed a substantial gap between LLMs and humans in humor evaluation, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor ranking into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 84.7% accuracy in caption ranking, significantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we advocate that truly improving LLM's creative understanding abilities necessitates systematic collection of human preference data across creative domains.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Computational Humor, Large Language Models, Preference Learning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Position papers

Languages Studied: English

Keywords: Computational Humor, Large Language Models, Preference Learning

Submission Number: 2090

Loading