Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer for Social Media Popularity Prediction

Xiaolu Chen, Weilong Chen, Chenghao Huang, Zhongjian Zhang, Lixin Duan, Yanru Zhang

Published: 2023, Last Modified: 15 Apr 2024ACM Multimedia 2023Readers: Everyone

Abstract: Social media popularity prediction aims to predict future interaction or attractiveness of new posts. However, in most existing works, there is a notable deficiency in the effective treatment of numerical features. Despite their significant potential to provide ample information, these features are often inadequately processed, leading to insufficiency of information acquirement. In this paper, we introduce a method, named Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer (DFT-MOVLT). To supplement the information in vision-and-language pre-training (VLP), we propose compound text, which is concatenated by numerical data and text. Furthermore, during VLP, a transformer is trained using 3 objectives to ensure thorough feature extraction. Finally, for more generalized prediction, we fine-tune 2 models using different training ways and ensemble them. To evaluate the effectiveness of each mechanism adopted in the proposed method, we conduct an array of ablation experiments. Our team achieve the 3rd place in Social Media Prediction (SMP) Challenge 2023.

0 Replies