Sequential Structured Fusion of Image and Text for Enhanced Multimodal Abstractive Summarization

Rui He, Minjie Qiang, Hongling Wang, Zhongqing Wang

Published: 2024, Last Modified: 06 Jan 2026NLPCC (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The Multimodal Abstractive Summarization task aims to generate a concise summary using given multimodal data (textual and visual). Existing related research is still simple splicing and blending of information from multiple modalities, without considering the interaction between image and corresponding text and the contextual structural relationship of the image and text. We believe that these existing models can’t fully integrate multimodal information and leverage the Transformer’s ability to process sequential data. To this end, for MAS task, we use image captions that are highly correlated with the image for image fusion; and design image-text alignment tasks to improve the effectiveness of visual modalities in embedding text summary tasks; and propose a sequential structured image-text fusion method to enhance the model’s ability of sequences semantic understanding. Through these methods, we can give full play to the contribution of visual modality information to the summary task to enhance the MAS model, thereby generating more accurate summaries. We conducted experiments on related dataset and found that ROUGE-1, ROUGE-2, and ROUGE-L improved by 1.34, 1.64, and 1.32 compared to the baseline model. Additionally, we contributed a large-scale sequential structured multimodal abstractive summarization dataset.

External IDs:dblp:conf/nlpcc/HeQWW24