Learning to Think Like a Cartoon Captionist: Humor Understanding With Multimodal Reasoning Models

Hatice Merve Vural; Doğa Kukul; Ege Erdem Özlü; Demir Ekin Arıkan; Bob Mankoff; Erkut Erdem; Aykut Erdem

Learning to Think Like a Cartoon Captionist: Humor Understanding With Multimodal Reasoning Models

Hatice Merve Vural, Doğa Kukul, Ege Erdem Özlü, Demir Ekin Arıkan, Bob Mankoff, Erkut Erdem, Aykut Erdem

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal large language models, humor understanding

Abstract: Humor remains one of the most elusive challenges for artificial intelligence, demanding models to integrate visual perception, cultural knowledge, and creative reasoning. The New Yorker Cartoon Caption Contest (NYCC) offers a uniquely structured testbed for this problem, pairing images with thousands of captions, expert curation, and large-scale audience judgments. Prior work largely reduces humor to black-box classification or preference prediction, overlooking the step-by-step reasoning processes employed by human captionists. We introduce a framework for teaching multimodal language models (MLLMs) to reason like professional captionists. Central to our approach are captionist reasoning traces that decompose humor into incongruity detection, resolution construction, and punchline evaluation. Models are first adapted through continual pretraining on humor-focused corpora, then trained with supervised fine-tuning on captionist-style traces, and finally aligned with humor judgments using reinforcement learning with grounded perceptual rewards and stylistic rewards. Across NYCC-derived matching and ranking tasks, our models significantly outperform strong multimodal baselines. Beyond accuracy, they generate explanations that align with expert strategies and audience preferences. These results highlight humor as a powerful frontier for multimodal reasoning and demonstrate that combining explicit reasoning supervision with preference alignment offers a scalable path toward computational humor.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9836

Loading