Keywords: Vision-language models, Meme generation, Hierarchical chain-of-thought (CoT) supervision, Pairwise reward modeling, Reinforcement Learning
TL;DR: HUMOR trains meme generators by multi-path CoT + rank-consistent pairwise rewards + group-wise RL, yielding human-aligned humor with bounded, theory-backed gains across vision–language base models.
Abstract: Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human-like preferences.
First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path.
This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity.
We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass.
Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template.
Following established theory, this approach ensures a consistent and robust proxy for human preference, even with noisy labels.
The reward model then enables a group-wise reinforcement learning optimization, guaranteeing that the model's humor quality does not degrade beyond a bounded amount.
Experiments show that HUMOR empowers various base VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality compared to strong baselines.
Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output groups.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20036
Loading