S 3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-Modal Sarcasm Detection

Peng Wang, Yongheng Zhang, Hao Fei, Qiguang Chen, Yukai Wang, Jiasheng Si, Wenpeng Lu, Min Li, Libo Qin

Published: 30 Nov 2025, Last Modified: 05 Dec 2025ACM Transactions on Multimedia Computing, Communications, and ApplicationsEveryoneRevisionsCC BY-SA 4.0
Abstract: Multi-modal sarcasm detection involves determining whether a given multi-modal input conveys sarcastic intent by analyzing the underlying sentiment. Recently, vision large language models have shown remarkable success on various of multi-modal tasks. Inspired by this, we systematically investigate the impact of vision large language models in zero-shot multi-modal sarcasm detection task. Furthermore, to capture different perspectives of sarcastic expressions, we propose a multi-view agent framework, S3 Agent, designed to enhance zero-shot multi-modal sarcasm detection by leveraging three critical perspectives: superficial expression, semantic information, and sentiment expression. Our experiments on the MMSD2.0 dataset, which involves six models and four prompting strategies, demonstrate that our approach achieves state-of-the-art performance. Our method achieves an average improvement of 13.2% in accuracy. Moreover, we evaluate our method on the text-only sarcasm detection task, where it also surpasses baseline approaches.
External IDs:doi:10.1145/3690642
Loading