MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection

Yifeng Xie; Zhihong Zhu; Xin Chen; Zhanpeng Chen; Zhiqi Huang

MoBA: Mixture of Bi-directional Adapter for Multi-modal Sarcasm Detection

Yifeng Xie, Zhihong Zhu, Xin Chen, Zhanpeng Chen, Zhiqi Huang

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the field of multi-modal learning, model parameters are typically large, necessitating the use of parameter-efficient fine-tuning (PEFT) techniques. These methods have been pivotal in enhancing training efficiency for downstream tasks in almost all situations. However, directly applying PEFT methods struggles to fully address the intricate demands of multi-modal tasks, such as multi-modal sarcasm detection (MSD), which demands the extraction and comparison of cues from different modalities. MSD, particularly when reliant on textual and visual modalities, faces challenges in identifying sarcasm's incongruity. This issue often arises from the lack of intermodality interaction during tuning, resulting in a disconnect between textual and visual information. In this paper, we introduce a novel approach called Bi-directional Adapter (BA), designated as MoBA. This approach is designed to minimize training parameters while enhancing the model's ability to interpret sarcasm across modalities. By facilitating an exchange between textual and visual information through a low-rank representation, our method adeptly captures the nuances of sarcastic expressions with a reduced number of training parameters. Our empirical studies, carried out on two publicly accessible and emerging datasets, demonstrate that our model substantially improves sarcasm detection accuracy. These findings indicate that our approach provides a more reliable and efficient solution to address the complexities of MSD.

Primary Subject Area: [Content] Vision and Language

Relevance To Conference: This work significantly contributes to the field of multimodal learning by enhancing the interaction between data from multiple modalities using a low-rank approach. It focuses on efficiently handling and analyzing complex datasets about multimodal sarcasm detection that incorporate both textual and visual elements.

Submission Number: 896

Loading