On VLMs for Diverse Tasks in Multimodal Meme Classification

On VLMs for Diverse Tasks in Multimodal Meme Classification

ACL ARR 2025 July Submission538 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The use of multimodal memes to spread hatred, propaganda, violence, in social and digital media can be tackled by content moderation by AI-based meme analysis. In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification, and introduced a novel approach Combining VLM Explanation to Fine-tune LLMs (CoVExFiL). We generated a VLM-based understanding of meme images and fine-tuned the LLMs on textual understanding of the embedded meme text. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies for these sub-tasks; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34\%, 3.52\% and 26.24\% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodal, vision-language models, meme classification

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 538

Loading