Video Summarization Pretraining with Self-Discovery of Informative Frames

ICLR 2026 Conference Submission15949 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video summarization
TL;DR: The paper provides a novel approach to video summarization pretraining using information bottleneck.
Abstract: The rapid proliferation of videos makes automated video summarization (VS) an essential research problem: "Which abridged video best conveys the whole story?" The limited size of datasets is known to constrain the generalization of advanced VS methods, requiring advanced pretraining techniques to capitalize on unlabeled videos. Several pretraining methods for VS have been proposed. Yet, they heavily rely on fixed pseudo-summaries, often fail to capture the diverse frame importance, resulting in narrow generalization. To resolve conflicts between pseudo-summaries and downstream tasks, our idea is: First, pretraining should enable the summarizer to learn how to distinguish more meaningful summaries from unlabeled videos, without perspective differentiation; In this way, finetuning only requires adapting the pretrained multifaceted importance to the downstream perspective, facilitating supervised learning. Our pretraining approach, named ViSP, is free of pseudo-summaries, expecting to better align with the ill-posed nature of defining keyframes. The pre-trained model can be fine-tuned to create the SOTA summarizers by leveraging the knowledge base behind frame saliency. ViSP is conceptually simple and empirically powerful, and it can be used to pre-train any neural video summarizer. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate the superiority of our approach.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15949
Loading