Improving Video Summarization by Exploring the Coherence Between Corresponding Captions

Published: 2025, Last Modified: 23 Jan 2026IEEE Trans. Image Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, $\tau $ and $\rho $ metrics on the BLiSS dataset.
Loading