Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking

Lu Li, Qipeng Wang, Baohua Zhao, Xinwei Li, Aihua Zhou, Hanqian Wu

Published: 01 Jul 2022, Last Modified: 16 Mar 2026ElectronicsEveryoneRevisionsCC BY-SA 4.0

Abstract: As an emerging research field, more and more researchers are turning their attention to multimodal entity linking (MEL). However, previous works always focus on obtaining joint representations of mentions and entities and then determining the relationship between mentions and entities by these representations. This means that their models are often very complex and will result in ignoring the relationship between different modal information from different corpus. To solve the above problems, we proposed a paradigm of pre-training and fine-tuning for MEL. We designed three different categories of NSP tasks for pre-training, i.e., mixed-modal, text-only and multimodal and doubled the amount of data for pre-training by swapping the roles of sentences in NSP. Our experimental results show that our model outperforms other baseline models and our pre-training strategies all contribute to the improvement of the results. In addition, our pre-training gives the final model a strong generalization capability that performs well even on smaller amounts of data.

External IDs:doi:10.3390/electronics11142134