Exploring the Trade-Off within Visual Information for MultiModal Sentence Summarization

Minghuan Yuan, Shiyao Cui, Xinghua Zhang, Shicheng Wang, Hongbo Xu, Tingwen Liu

Published: 2024, Last Modified: 15 Jan 2026SIGIR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: MultiModal Sentence Summarization (MMSS) aims to generate a brief summary based on the given source sentence and its associated image. Previous studies on MMSS have achieved success by either selecting the task-relevant visual information or filtering out the task-irrelevant visual information to help the textual modality to generate the summary. However, enhancing from a single perspective usually introduces over-preservation or over-compression problems. To tackle these issues, we resort to Information Bottleneck (IB), which seeks to find a maximally compressed mapping of the input information that preserves as much information about the target as possible. Specifically, we propose a novel method, T(3), which adopts IB to balance the Trade-off between Task-relevant and Task-irrelevant visual information through the variational inference framework. In this way, the task-irrelevant visual information is compressed to the utmost while the task-relevant visual information is maximally retained. With the holistic perspective, the generated summary could maintain as many key elements as possible while discarding the unnecessary ones as far as possible. Extensive experiments on the representative MMSS dataset demonstrate the superiority of our proposed method. Our code is available at https://github.com/YuanMinghuan/T3.