Meme Generation with Multi-modal Input and Planning

Published: 01 Jan 2024, Last Modified: 11 Oct 2025MMGR@MM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Memes are a popular multi-modal artefact used across social media platforms to convey various emotions and ideas such as humor, distress, commentary, etc. With the advent of generative AI, we observe an uptick in the interest in synthesizing memes from user-provided inputs. The previous works have focused on generating memes through either text prompt or template images as the query. Such input formats are extremely restrictive in clearly specifying the intent with a single modality. In this work, we explore a novel multi-modal input specification style where a user can provide the input through a text prompt along with a widely popular meme template image. We consider the meme generation task as a combination of two sub-tasks: (i) meme image template retrieval and (ii) meme text caption generation. We present a novel template and caption planning strategy to effectively represent the multi-modal user input for both the sub-tasks. We demonstrate the effectiveness of the proposed system for meme generation through experiment and user study. We observe that users found the memes generated through our system easy to understand (~77%), funny/humorous (~70%), non-offensive (˜99%), and relevant to their query (3.38/5).
Loading