Exploring the Trade-Off within Visual Information for MultiModal Sentence Summarization

Published: 01 Jan 2024, Last Modified: 22 Jul 2025SIGIR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: MultiModal Sentence Summarization (MMSS) aims to generate a brief summary based on the given source sentence and its associated image. Previous studies on MMSS have achieved success by either selecting the task-relevant visual information or filtering out the task-irrelevant visual information to help the textual modality to generate the summary. However, enhancing from a single perspective usually introduces over-preservation or over-compression problems. To tackle these issues, we resort to Information Bottleneck (IB), which seeks to find a maximally compressed mapping of the input information that preserves as much information about the target as possible. Specifically, we propose a novel method, T(3), which adopts IB to balance the <u>T</u>rade-off between <u>T</u>ask-relevant and <u>T</u>ask-irrelevant visual information through the variational inference framework. In this way, the task-irrelevant visual information is compressed to the utmost while the task-relevant visual information is maximally retained. With the holistic perspective, the generated summary could maintain as many key elements as possible while discarding the unnecessary ones as far as possible. Extensive experiments on the representative MMSS dataset demonstrate the superiority of our proposed method. Our code is available at https://github.com/YuanMinghuan/T3.
Loading