Abstract: Multimodal knowledge graphs (MMKGs) have gained widespread adoption across various domains. However, existing transformer-based methods for MMKG representation learning primarily focus on enhancing representation performance, while overlooking time and memory costs, which reduces model efficiency. To tackle these limitations, we introduce a multimodal lightweight transformer (MLFormer) model, which not only ensures robust representation capabilities but also considerably improves computational efficiency. We find that the self-attention mechanism in transformers leads to substantial performance overheads. As a result, we optimize the traditional MMKGE model in two aspects: modality processing and modality fusion, by incorporating a filter gate and Fourier transform. Our experimental results on real-world multimodal knowledge graph completion datasets demonstrate that MLFormer achieves significant improvements in computational efficiency while maintaining competitive performance.
External IDs:dblp:journals/tcss/WangLCSQL25
Loading