Abstract: Multimodal Summarization aims to use multimodal data to generate accurate and concise summaries for long sentences. While previous work has achieved promising success, they have overlooked the mismatching among multimodal semantics and lacked subject information guidance for adaptive referential images. Motivated by this observation, we propose ASSM, an Adaptive Subject-focused modeling for multimodal summarization via Semantic Matching. The novelty of ASSM lies in two aspects. First, we propose a multimodal semantic matching module that projects multimodal inputs into a shared joint embedding semantic space to determine whether the semantics between multimodalities are mismatching. Second, we propose an adaptive subject-focused guide module, which adaptively references images to learn subject tokens based on the multimodal semantic matching results. With these subject tokens, we are able to focus on the subject information, providing precise guidance for summary generation. We conduct extensive experiments on two standard benchmarks and compare ASSM with 17 existing models. The experimental results regarding ROUGE, BERTScore, and MoverScore show that the proposed ASSM model outperforms all competitors, achieving state-of-the-art performance and suggesting the effectiveness of our proposal. In addition, we provide a case study to further demonstrate the usability of ASSM.
External IDs:dblp:journals/tkde/ZhaoDJ25
Loading