Multi-Grained Gradual Inference Model for Multimedia Event Extraction

Yang Liu, Fang Liu, Licheng Jiao, Qianyue Bao, Long Sun, Shuo Li, Lingling Li, Xu Liu

Published: 2024, Last Modified: 25 Mar 2026IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the development of multimedia technology, events are usually presented in multimedia forms, thus multimedia event extraction (MEE) has become more and more important. Existing MEE works usually use simple strategies to align two modalities, making it difficult to precisely extract events and arguments in complex multimedia documents. To address this problem, we propose a novel Multi-grained Gradual Inference Model (MGIM) that focuses on inferring and interpreting events in complex multimedia structures in a coarse-to-fine manner. To efficiently integrate textual and visual modalities, we design a Coarse-grained Alignment (CA) module, which represents the two modalities in a graph structure and performs coarse-grained alignment. Based on the CA module, we further propose a Fine-grained Inference module (FI) that fine-grained aligns text and image by performing multiple rounds of gradual inference. MGIM provides a comprehensive interpretation of multimedia events at two information granularities (coarse and fine). Extensive experiments on the M2E2 dataset demonstrate the effectiveness of MGIM.