M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER

Jie Wang, Yan Yang, Keyu Liu, Zhiping Zhu, Xiaorong Liu

2023 (modified: 23 Apr 2023)IEEE ACM Trans. Audio Speech Lang. Process. 2023Readers: Everyone

Abstract: Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-only NER with visual information, has recently attracted considerable attention. Most current MNER models have made significant progress by jointly understanding visual and language modalities through layers of cross-modality attention. However, these approaches largely ignore the visual bias brought by the image contents and barely consider exploiting the multi-granularity representations and the interactions between visual objects, which are essential in recognizing ambiguous entities. In this paper, we propose a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S cene graph driven <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M ulti-modal <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M ulti-granularity <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M ulti-task learning ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">M3S ) framework to better exploit visual and textual information in MNER. Specifically, to explicitly alleviate visual bias, we present a novel multi-task approach by employing the task of Named Entity Segmentation (NES) cascade with Named Entity Categorization (NEC). To obtain detailed visual semantics by explicitly modeling objects and relationships between paired objects, we construct scene graphs as a structured representation of the visual contents. Furthermore, a well-designed Multi-granularity Gated Aggregation (MGA) mechanism is introduced to capture inter-modality interactions and extract critical features for named entity recognition. Extensive experiments on two real public datasets demonstrate the effectiveness of our proposed M3S.

0 Replies