Compact bilinear pooling and multi-loss network for social media multimodal classification

Ziwen Chen

Published: 11 Aug 2024, Last Modified: 04 Nov 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Social media platforms have seen an influx of multimodal data, leading to heightened attention on image-text multimodal classification. Existing methods for multimodal classification primarily focus on multimodal fusion from different modalities. However, owing to the heterogeneity and high-dimensionality of multimodal data, the fusion process frequently introduces redundant information and noise limiting the accuracy and generalization. To resolve the limitation, we propose a Compact Bilinear pooling and Multi-Loss network (CBMLNet). Compact bilinear pooling is used for feature fusion to learn low-dimensional and expressive multimodal representations efficiently. Furthermore, a multi-loss function is proposed to import the specific information carried by each single modality. Therefore, CBMLNet simultaneously considers the correlation between multimodality and the specificity of single modality for image-text classification. We evaluate the proposed CBMLNet on two publicly available datasets, Twitter-15 and Twitter-17, and on a private dataset, AIFUN. CBMLNet is compared with the advanced methods such as multimodal BERT with Max Pooling, Multi-Interactive Memory Network, Multi-level Multi-modal Cross-attention Network, Image-Text Correlation model (ITC), Target-oriented multimodal BERT and multimodal hierarchical attention model (MHA). Experimental results demonstrate that CBMLNet averagely improves F1_score by 0.28% and 0.44% compared with the best fine-grained baseline, MHA and the best coarse-grained baseline, ITC. It illustrates that CBMLNet is practical for real-world fuzzy applications as a coarse-grained model.