Abstract: Multimodal Entity Linking (MEL) aims to address the ambiguity in multimodal mentions and associate them with Multimodal Knowledge Graphs (MMKGs). Existing works primarily focus on designing multimodal interaction and fusion mechanisms to enhance the performance of MEL. However, these methods still overlook two crucial gaps within the MEL task. One is the content discrepancy between mentions and entities, manifested as uneven information density. The other is the knowledge gap, indicating insufficient knowledge extraction and reasoning during the linking process. To bridge these gaps, we propose a novel framework FissFuse, as well as a plug-and-play knowledge-aware re-ranking method KAR. Specifically, FissFuse collaborates with the Fission and Fusion branches, establishing dynamic features for each mention-entity pair and adaptively learning multimodal interactions to alleviate content discrepancy. Meanwhile, KAR is endowed with carefully crafted instruction for intricate knowledge reasoning, serving as re-ranking agents empowered by Large Language Models (LLMs). Extensive experiments on two well-constructed MEL datasets demonstrate outstanding performance of FissFuse compared with various baselines. Comprehensive evaluations and ablation experiments validate the effectiveness and generality of KAR.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work introduces a new framework for multimodal entity linking, which employs multimodal fusion and fission branches to eliminate the ambiguity of entities within multimodal contexts (textual and visual data) and associates entities with multimodal knowledge graphs. This approach advances the field of multimodal processing by aligning multimodal data with external knowledge and further helps knowledge-intensive multimodal tasks.
Submission Number: 5387
Loading