Difficulty-Based Training Strategy with MLLMs for Multimodal Sarcasm Explanation

ACL ARR 2024 June Submission2210 Authors

15 Jun 2024 (modified: 07 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which aims at generating natural language explanations for sarcasm in social media image-text pairs. MuSE can further enhance sarcasm understanding and has attracted increasing research interest. Previous works design cross-modal attention or multi-source semantic graphs and achieve promising performance. However, these works either ignore the semantic gap between visual features and textual decoder or introduce complex graph constructions, which limits their practical applicability and scalability for real-world scenarios. Furthermore, they treat each sample equally during training, overlooking the different effects of samples at different levels of difficulty. In this paper, we propose a novel MultiDimensional Sample Difficulty (MDSD) based training strategy with the Multimodal Large Language Models (MLLMs) for MuSE.Leveraging the multidimensional sample difficulty of image-text pairs, we enable MLLMs to learn from easy to hard samples in the training stage, mitigating the impact of samples of varying difficulty and preventing local optima. We can achieve better cross-modal alignment without complicated procedures based on the alignment and innate knowledge of MLLMs. Experimental results on two open-source MLLMs on a publicly released dataset MORE demonstrate that MDSD can further enhance MLLMs and achieve state-of-the-art performance.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multi-dimensional sample difficulty,multimodal large language models,multimodal sarcasm explanation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2210
Loading