Contrastive Learning for Multimodal Classification of Crisis related Tweets

Bishwas Mandal, Sarthak Khanal, Doina Caragea

Published: 01 Jan 2024, Last Modified: 30 Sept 2024WWW 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal tasks require learning a joint representation of the constituent modalities of data. Contrastive learning learns a joint representation by using a contrastive loss. For example, CLIP takes as input image-caption pairs and is trained to maximize the similarity between an image and its corresponding caption in actual image-caption pairs, while minimizing the similarity for arbitrary image-caption pairs. This approach operates on the premise that the caption depicts the image's content. However, this assumption does not always hold true for tweets that contain both text and images. Previous studies have indicated that the connection between the image and the text in a tweet is more intricate and complex. We study the effectiveness of pre-trained multimodal contrastive learning models, specifically, CLIP, and ALIGN, on the task of classifying multimodal crisis related tweets. Our experiments using two publicly available datasets, CrisisMMD and DMD, show that despite the intricate relationships in tweets, pre-trained contrastive learning models fine-tuned with task-specific data produce better results than prior approaches used for the multimodal classification of crisis related tweets. Additionally, the experiments show that the contrastive learning models are effective in low-data few-shot and cross-domain settings.