SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding

Published: 2025, Last Modified: 15 Jan 2026IEEE Trans. Affect. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the era of large language models (LLMs), tasks associated with “System I” cognition—those that are fast, automatic, and intuitive, such as sentiment analysis and text classification—are often considered effectively solved. However, sarcasm remains a persistent challenge. As a subtle and complex linguistic phenomenon, sarcasm frequently involves rhetorical devices such as hyperbole and figurative language to express implicit sentiments and intentions, demanding a higher level of abstraction and pragmatic reasoning than standard sentiment analysis. This raises concerns about whether current claims of LLM success extend robustly to the domain of sarcasm understanding. To systematically investigate this issue, we introduce a new high-quality multi-modal sarcasm detection dataset, termed AMSD, and construct a comprehensive evaluation benchmark, SarcasmBench. Our benchmark encompasses 16 state-of-the-art (SOTA) LLMs and 8 strong pretrained language models (PLMs), evaluated across six widely-used textual sarcasm datasets and three multi-modal sarcasm benchmarks. We adopt three popular prompting paradigms: zero-shot input/output (IO) prompting, few-shot IO prompting, and chain-of-thought (CoT) prompting. Our extensive experiments yield three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines. This suggests that significant efforts are still required to improve LLMs’ understanding of human sarcasm. (2) GPT-4 and Gemini 2.0 consistently and significantly outperforms other LLMs across various prompting methods. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. We hope this benchmark will serve as a valuable resource for the research community and inspire future work toward more robust and human-aligned sarcasm understanding.
Loading