CLIP-PubOp: A CLIP-based Multimodal Representation Fusion Method for Public Opinion

Published: 01 Jan 2023, Last Modified: 21 Feb 2025IEEE Big Data 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision Language Pre-training (VLP) has made significant progress in the field of universal multimodality in recent years. Universal multimodal datasets (such as MSCOCO, Flickr30k, etc.) have become one of the standards for evaluating VLP models which rely on images and corresponding captions for representation modeling. However, in a public opinion event, besides image captions, it also includes other texts such as content texts and comments, which may have a positive impact on image-text representations of public opinion. In this paper, we propose the method to explore the positive effect of content texts on initial multimodal representations. We name our model CLIP-PubOp, which is based on CLIP, a famous VLP model using contrastive learning. We add a linear fusion layer before the fusion of image and text representations, which fuses event content text representations and caption representations in a certain proportion to obtain enhanced text representations. On this basis, multimodal representations can be obtained by fusing enhanced text representations with image representations. We crawl through four mainstream categories of public opinion events online as our datasets and conduct experiments on both Chinese and English version of datasets. The experimental results show that content texts of public opinion events have a significant positive effect on multimodal representations, with an average accuracy improvement of about 2%-10% in image-text retrieval tasks.
Loading