Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity Recognition

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As an important task in multimodal information extraction, Multimodal Named Entity Recognition (MNER) has recently attracted considerable attention. One key challenge of MNER lies in the lack of sufficient fine-grained annotated data, especially in low-resource scenarios. Although data augmentation is a widely used technique to tackle the above issue, it is challenging to simultaneously generate synthetic text-image pairs and their corresponding high-quality entity annotations. In this work, we propose a novel Generative Multimodal Data Augmentation (GMDA) framework for MNER, which contains two stages: Multimodal Text Generation and Multimodal Image Generation. Specifically, we first transform each annotated sentence into a linearized labeled sequence, and then train a Label-aware Multimodal Large Language Model (LMLLM) to generate the labeled sequence based on a label-aware prompt and its associated image. After using the trained LMLLM to generate synthetic labeled sentences, we further employ a Stable Diffusion model to generate the synthetic images that are semantically related to these sentences. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed GMDA framework, which consistently boosts the performance of several competitive methods for two subtasks of MNER in both full-supervision and low-resource settings.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The contribution of our work to multimedia/multimodal processing can be detailed as follows. we propose a novel Generative Multimodal Data Augmentation (GMDA) framework for Multimodal Named Entity Recognition (MNER), which address these challenges in MNER: 1. One key challenge of the existing approaches on the MNER task is their heavy reliance on annotated data. In real applications, it is often time-consuming and costly to obtain such human annotation, which hinders the effectiveness of existing MNER models in many low-resource scenarios. 2. Data augmentation (DA) is a widely used technique to address the data sparsity issue. However, compared with text DA methods for NER, it is more challenging to simultaneously generate synthetic text-image pairs and their corresponding high-quality entity annotations for MNER. Under the GMDA framework, we devise a Label-aware Multimodal Large Language Model (LMLLM) to generate synthetic labeled sentences, followed by employing a latent diffusion model to generate the synthetic image for each labeled sentence. A large number of text-image pairs with fine-grained entity annotations are generated for MNER. Experimental results on three benchmark datasets demonstrate that GMDA consistently boosts the performance of several competitive methods for two subtasks of MNER in both full-supervision and low-resource settings.
Supplementary Material: zip
Submission Number: 4934
Loading