Knowledge-Enhanced Multi-perspective Incongruity Perception Network for Multimodal Sarcasm Detection

Published: 2024, Last Modified: 26 May 2026ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent years have witnessed the urgent request for multi-modal sarcasm detection in social media platforms. Though large efforts have been made with significant progress, prior arts may fail to fully integrate the commonsense knowledge, and struggle to model the incongruity within implicit meanings of multimodal cues. To address these limitations, we propose a novel Knowledge-Enhanced Multi-perspective Incongruity Perception network, named KEMIP. Specifically, we adopt generative language models to produce captions and commonsense for image and text respectively for comprehensive understanding. Subsequently, to exploit the essential cues from multiple perspectives, we customize an incongruity perception module, which utilizes several fusion networks to capture both literal and implicit inconsistencies. Afterwards, an ensemble weighting gate is employed to integrate the result of individual perspective. Experiments on a public multimodal sarcasm detection benchmark demonstrate the superiority of our proposed KEMIP framework.
Loading