Keywords: multimodal large language models, humor understanding
Abstract: Humor remains one of the most elusive challenges for artificial intelligence, demanding models to integrate visual perception, cultural knowledge, and creative reasoning. The New Yorker Cartoon Caption Contest (NYCC) offers a uniquely structured testbed for this problem, pairing images with thousands of captions, expert curation, and large-scale audience judgments. Prior work largely reduces humor to black-box classification or preference prediction, overlooking the step-by-step reasoning processes employed by human captionists. We introduce a framework for teaching multimodal language models (MLLMs) to reason like professional captionists. Central to our approach are captionist reasoning traces that decompose humor into incongruity detection, resolution construction, and punchline evaluation. Models are first adapted through continual pretraining on humor-focused corpora, then trained with supervised fine-tuning on captionist-style traces, and finally aligned with humor judgments using reinforcement learning with grounded perceptual rewards and stylistic rewards. Across NYCC-derived matching and ranking tasks, our models significantly outperform strong multimodal baselines. Beyond accuracy, they generate explanations that align with expert strategies and audience preferences. These results highlight humor as a powerful frontier for multimodal reasoning and demonstrate that combining explicit reasoning supervision with preference alignment offers a scalable path toward computational humor.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9836
Loading