Unsupervised Video Summarization Based on the Diffusion Model of Feature Fusion

Qinghao Yu, Hui Yu, Ying Sun, Derui Ding, Muwei Jian

Published: 2024, Last Modified: 14 Nov 2024IEEE Trans. Comput. Soc. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video summarization (VS) technologies can automatically extract key frames with effective information and thus can help to quickly identify the events or speed up the decision-making process, especially for accidents. With the fast development of deep learning technologies, many generative adversarial network (GAN)- and reinforcement learning (RL)-based unsupervised VS methods have been developed in recent years. However, these methods could suffer from the problems of unstable training and difficulty of reward function formulation, respectively. To this end, we present an unsupervised VS method called diffusion model of feature fusion (DMFF) in this article, which consists of a diffusion module (DM), a feature extraction and compression module (FECM), and a coarse-fine frame selector (CFFS). DM is designed to avoid the training instability problem caused by GAN's alternate training generator and discriminator. FECM is used to extract and compress video features. CFFS is designed to capture both low-level and high-level features between frames to handle complex and diverse accident videos. Then, high-level local and global features are fused to generate a multigrained final frame score. Experiments on two widely used benchmark datasets, SumMe and TVSum, demonstrate the effectiveness and superiority of the proposed network to the state-of-the-art methods, and the training is more stable.