Abstract: With the rise of multimedia-driven content on the internet, multimodal relation extraction has gained significant importance in various domains, such as intelligent search and multimodal knowledge graph construction. Social media, as a rich source of image-text data, plays a crucial role in populating knowledge bases. However, the noisy information present in social media data poses a challenge in multimodal relation extraction. Current methods focus on extracting relevant information from images to improve model performance but often overlook the importance of global image information. In this paper, we propose a novel multimodal relation extraction method, named FocalMRE, which leverages image focal augmentation, focal attention, and gating mechanisms. FocalMRE enables the model to concentrate on the image's focal regions while effectively utilizing the global information in the image. Through gating mechanisms, FocalMRE optimizes the multimodal fusion strategy, allowing the model to select the most relevant augmented regions for overcoming noise interference in relation extraction. The experimental results on the public MNRE dataset reveal that our proposed method exhibits robust and significant performance advantages in the multimodal relation extraction task, especially in scenarios with high noise, long-tail distributions, and limited resources.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work contributes to the field of multimedia/multimodal processing by proposing a novel method for multimodal relation extraction that addresses the challenge of noisy information present in social media data. The proposed method, which utilizes image focal augmentation, focal attention, and gating mechanisms, enables the model to focus on the image focal area that is relevant to the text while making full use of the global information in the image, resulting in more accurate relation extraction. Furthermore, the proposed method optimizes the multimodal fusion strategy of image and text information by augmenting focal regions in the image and utilizing gating mechanisms to select regions most relevant for relation extraction. These innovations contribute to the development of more effective multimodal processing techniques.
Submission Number: 2135
Loading