Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference Resolution

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Coreference resolution, an essential task in natural language processing, is particularly challenging in multi-modal scenarios where data comes in various forms and modalities. Despite advancements, limitations due to scarce labeled data and underleveraged unlabeled data persist. We address these issues with a self-adaptive fine-grained multi-modal data augmentation framework for semi-supervised MCR, focusing on enriching training data from labeled datasets and tapping into the untapped potential of unlabeled data. Regarding the former issue, we first leverage text coreference resolution datasets and diffusion models, to perform fine-grained text-to-image generation with aligned text entities and image bounding boxes. We then introduce a self-adaptive selection strategy, meticulously curating the augmented data to enhance the diversity and volume of the training set without compromising its quality. For the latter issue, we design a self-adaptive threshold strategy that dynamically adjusts the confidence threshold based on the model’s learning status and performance, enabling effective utilization of valuable information from unlabeled data. Additionally, we incorporate a distance smoothing term, which smooths distances between positive and negative samples, enhancing discriminative power of the model’s feature representations and addressing noise and uncertainty in the unlabeled data. Our experiments on the widely-used CIN dataset show that our framework significantly outperforms state-of-the-art baselines by at least 9.57\% on MUC F1 score and 4.92\% on CoNLL F1 score. Remarkably, against weakly-supervised baselines, our framework achieves a staggering 22.24\% enhancement in MUC F1 score. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MCR tasks. Our codes: https://anonymous.4open.science/r/SLUDA.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Multimodal coreference resolution is one of the crucial tasks in the field of multimedia, widely applied in areas such as semantic understanding, information retrieval, and intelligent recommendation. In this paper, our proposed self-adaptive expansion and selection techniques effectively address the scarcity of labeled data in multimodal processing. By generating and selecting high-quality MCR data, we enrich the training data and overcome limitations caused by limited labeled samples. The self-adaptive threshold strategy balances the quality and quantity of unlabeled data, while the distance smoothing policy enhances feature representations. Experimental results on the CIN dataset show that our SLUDA framework achieves state-of-the-art performance, outperforming the best baseline significantly. In conclusion, our work contributes to enriching and efficiently utilizing multimodal data, as well as improving the application of multimodal coreference resolution in the field of multimedia. It holds significant importance in enhancing multimedia information understanding.
Supplementary Material: zip
Submission Number: 1993
Loading