Abstract: Multimodal abstractive summarization for videos is an emerging task that aims to generate a summary from multi-source information (i.e., video, audio transcript). The challenge is how to merge multimodal long sequences to capture rich semantic information without allowing possible noise from either lengthy modal sequence to degrade the other modality and thus hurt the entire model. To address the issues, we propose a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">m</b> ultistage <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</b> usion network with <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</b> orget <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">g</b> ate (MFFG), which selectively integrates multi-source information through the cross-fusion in encoding and hierarchical fusion in decoding between modalities, and design a fusion forget gate module to suppress the potential multimodal noise flow of multi-source long sequence. Meanwhile, considering that the source text in this task is lengthy and has the same distribution as the output summary text, we inherit the partial structure of the MFFG model and again propose its variant, single-stage fusion network with forget gate (SFFG), which simplifies the fusion schema, and leverages the long source text to enhance the representation of the target summary. Experimental results on How2 dataset and How2-300 dataset demonstrate the superiority of the two multimodal fusion methods. Further, we provide a version of ASR transcription data of How2 dataset to evaluate model performance under noisy scenarios, and experimental results show obvious advantages of our proposed models over prior systems.
0 Replies
Loading