MHSCNET: A Multimodal Hierarchical Shot-Aware Convolutional Network for Video Summarization

Wujiang Xu, Runzhong Wang, Xiaobo Guo, Shaoshuai Li, Qiongxu Ma, Yunan Zhao, Sheng Guo, Zhenfeng Zhu, Junchi Yan

Published: 2023, Last Modified: 30 Sept 2024ICASSP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video summarization is an essential problem in signal processing, which intends to produce a concise summary of the original video. Existing video summarization approaches regard the task as a keyframe selection problem and generally construct the frame-wise representation by combining the long-range temporal dependency with either unimodal or bimodal information. The optimal keyframe should offer the semantic summarization of the whole content by exploiting the multimodal and shot-level hierarchical natures of videos, however, such natures are not fully exploited in existing methods. In this paper, we propose to construct a more powerful and robust frame-wise representation and predict the frame-level importance score in a fair and comprehensive manner. Specifically, we propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation via combining the comprehensive available multimodal information. We further design a hierarchical ShotConv network to incorporate the adaptive shot-aware frame-level representation by considering the short-range and long-range temporal dependencies. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video. Extensive experiments on two standard video summarization datasets demonstrate that our proposed method consistently outperforms state-of-the-arts.