ViL-Sum: Enhancing Vision and Language Representations via Multi-task Learning for Multi-modal SummarizationDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: With the advance of multimedia on the Internet, multi-modal summarization has drawn much attention. Most current methods follow a pipeline strategy, where an off-the-shelf object detector is used to extract visual features which are then fused with language representations for decoder to generate. However, these methods suffer two issues 1) separate vision and language representations fail to capture the interrelations within the two modalities; 2) from the local view, the semantic alignments between images and paragraphs are missing. In order to address these problems, in this paper, we propose a novel Vision-Language Summarization (ViL-Sum) model with a multi-task learning framework. Specifically, we train our model with two auxiliary tasks in a multi-task manner, that are images selection and images reordering. In this way, the interrelations within image and text are well captured. Besides, to further enhance the vision-language representation, we employ a unified transformer-based encoder-decoder structure. The encoder simultaneously takes image and text as input and jointly learns the representations of both. Then the representations are used by the decoder to generate the summary. Experimental results show that ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that the enhanced representations via multi-task training and joint modeling learn reasonable relations between image and text.
Paper Type: long
0 Replies

Loading