Abstract: Despite significant advancements in image segmentation
and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical
humor as a paradigmatic example of image interpretation
that requires elucidating the interaction of different scene
elements in the context of prior cognitive knowledge. This
paper introduces HumorDB, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches,
and AI-generated content, including minimally contrastive
pairs where subtle edits differentiate between humorous and
non-humorous versions. We evaluate humans, state-of-theart vision models, and large vision-language models on
three tasks: binary humor classification, funniness rating
prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on
the precise regions that make the image funny. Preliminary
mechanistic interpretability studies and evaluation of model
explanations provide initial insights into how different architectures process humor. Our results identify promising
trends and current limitations, suggesting that an effective
understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features
and bridging the gap between visual perception and abstract reasoning.
Loading