MCG-MNER: A Multi-Granularity Cross-Modality Generative Framework for Multimodal NER with InstructionOpen Website

Published: 01 Jan 2023, Last Modified: 14 Dec 2023ACM Multimedia 2023Readers: Everyone
Abstract: Multimodal named entity recognition (MNER) is an essential task of vision and language, which aims to locate named entities and classify them to the predefined categories using visual scenarios. However, existing MNER studies often suffer from bias issues with fine-grained visual cue fusion, which may produce noisy coarse-grained visual cues for MNER. To accurately capture text-image relations and better refine multimodal representations, we propose a novel instruction-based Multi-granularity Cross-modality Generative framework for MNER, namely MCG-MNER. Concretely, we introduce a multi-granularity relation propagation to infer visual clues relevant to text. Then, we propose a method to jnject multi-granularity visual information into cross-modality interaction and fusion to learn a unified representation. Finally, we integrate task-specific instructions and answers for MCG-MNER. Comprehensive experimental results on three benchmark datasets, such as Twitter2015, Twitter2017 and WikiDiverse, demonstrate the superiority of our proposed method over several state-of-the-art MNER methods. We will publicly release our codes for future studies.
0 Replies

Loading