Beyond Embedding Fusion: LLM-Driven Structural Enhancement for Multimodal Knowledge Graph Completion

Beyond Embedding Fusion: LLM-Driven Structural Enhancement for Multimodal Knowledge Graph Completion

ACL ARR 2026 January Submission1607 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge graph, Multimodal, Large Language Model

Abstract: Multimodal knowledge graph completion (MKGC) aims to improve structural reasoning by incorporating visual and textual information. However, existing approaches rely heavily on embedding fusion, where multimodal features are compressed and fused into a unified vector before structural prediction. This compression-then-fusion paradigm inevitably reduces the rich semantics carried by raw modalities, treating them as auxiliary cues rather than sources of explicit structural knowledge. As a result, current MKGC methods often fail to capture deeper relational semantics implied in texts and images. To address this limitation, we propose LLM-SE (Large Language Model–driven Structural Enhancement), a generate-then-disentangle framework that transforms raw multimodal signals into explicit structural triplets instead of collapsing them into unified embeddings. LLM-SE includes two main modules: (1) Multimodal Triplet Generation, which performs the generation step by leveraging large multimodal models to extract meaningful triplets from texts and images; and (2) Dual-View Complex module, a disentanglement mechanism that separates origin triplets from LLM-generated deep triplets, enabling the model to adaptively capture stable and exploratory knowledge. Extensive experiments on multiple MKGC benchmarks show that LLM-SE consistently outperforms state-of-the-art models across all metrics.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Generation, Information Extraction, Machine Learning for NLP

Languages Studied: English

Submission Number: 1607

Loading