Abstract: Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at https://github.com/wzq0214/H-SMAE.
External IDs:doi:10.1109/tcsvt.2025.3557254
Loading