Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 oralEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: comic narrative understanding, visual reasoning, multimodal benchmark, humor understanding
Abstract: Recent advancements in large vision language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large vision language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even the state-of-the-art models still struggle with this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
Primary Area: Natural language processing
Flagged For Ethics Review: true
Submission Number: 18748
Loading