Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24EveryoneRevisionsBibTeX
Keywords: Multimodal Relation Extraction, Multimodal Fusion
Abstract: Multimodal relation extraction is a fundamental task of multimodal information extraction. Recent studies have shown promising results by integrating hierarchical visual features from local regions, like image patches, to the broader global regions that form the entire image. However, research to date has largely ignored the understanding of how hierarchical visual semantics are represented and the characteristics that can benefit relation extraction. To bridge this gap, we propose a novel two-stage hierarchical visual context fusion transformer incorporating the mixture of multimodal experts framework to effectively represent and integrate hierarchical visual features into textual semantic representations. In addition, we introduce the concept of hierarchical tracking maps to facilitate the understanding of the intrinsic mechanisms of image information processing involved in multimodal models. We thoroughly investigate the implications of hierarchical visual contexts through four dimensions: performance evaluation, the nature of auxiliary visual information, the patterns observed in the image encoding hierarchy, and the significance of various visual encoding levels. Empirical studies show that our approach achieves new state-of-the-art performance on the MNRE dataset.
Track: Web Mining and Content Analysis
Submission Guidelines Scope: Yes
Submission Guidelines Blind: Yes
Submission Guidelines Format: Yes
Submission Guidelines Limit: Yes
Submission Guidelines Authorship: Yes
Student Author: Yes
Submission Number: 1741
Loading