Coarse-to-fine multimodal prototype network for few-shot multimodal relation extraction

Published: 2025, Last Modified: 21 Jan 2026Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal relation extraction (MRE) aims to identify textual entity relations in free text with incorporated images. The majority of methods for MRE rely heavily on a large number of manually annotated samples, and their performance drops sharply when the available labeled data is insufficient. However, annotating data in many expert domains is knowledge-intensive and time-consuming, necessitating significant labeling efforts. Inspired by the success of meta-learning in various methods, several studies have applied few-shot learning to the MRE task for reducing labeling efforts. Nevertheless, existing few-shot multimodal relation extraction methods typically rely on shallow features from text and image modalities, neglecting the latent relation-label-aware cues in multimodal data. Besides, they struggle to capture fine-grained multimodal interactions aligned with entity semantics. These limitations hinder the models from effectively focusing on the most informative parts in text-image pairs, while simultaneously restricting their capability to reason in complex multimodal scenarios. To overcome these shortcomings, we propose a Coarse-to-fine Multimodal Prototype Network (CMPNet) that learns hierarchical multimodal features to solve the few-shot multimodal relation extraction task. More specifically, our model captures multimodal features through a cross-attention module in a gated way, while obtaining semantic-aware representations with two guided modules. One module guided by relation-label semantics is designed to align prototype with the semantic characteristics of relations. Another one captures multimodal features associated with entities, guiding the model to concentrate on text and visual information closely tied to the entities. Extensive experiments on two benchmark datasets demonstrate that CMPNet outperforms the previous baseline models under different few-shot settings, confirming the effectiveness of our model.
Loading