Cross-modal event extraction via Visual Event Grounding and Semantic Relation Filling

Published: 01 Jan 2025, Last Modified: 19 May 2025Inf. Process. Manag. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cross-modal event extraction aims to retrieve salient information from multiple modalities and capture all arguments associated with a specific event. However, existing methods often rely on weak alignment and generative data augmentation techniques, which inadequately address the inherent out-of-focus challenge posed by the complexity of social media images. Additionally, these methods frequently neglect the significance of preserving semantic relationships between entities in cross-modal event extraction tasks. In this paper, we propose the Visual Event Grounding and Semantic Relation Filling (VEGSRF) approach to tackle these challenges. To address the out-of-focus issue, the Visual Event Grounding (VEG) module employs textual event detection to prioritize relevant events within visual content. Meanwhile, the Semantic Relation Filling (SRF) module utilizes event schema as prompt information, facilitating template-filling to capture the meta-relationships between image entities. To rigorously evaluate the VEGSRF approach, we construct the Chinese Multimodal Multi-Event (CMMEvent) dataset, which includes 13 event types and 34 sub-types. Extensive experiments conducted on the M2E2 and CMMEvent datasets show significant improvements in F1 scores, with increases of 6.7% and 0.7%, respectively.
Loading