Abstract: Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and summarization. In this paper, we demonstrate that state-of-the-art GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the common design of GEBD models using image-domain backbones can contain plenty of architecture redundancy, motivating us to gradually “modernize” each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for the GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling for GEBD is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7% performance growth and 280% practical speedup under the same backbone choice. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at https://github.com/anonymous.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: GEBD is a recently proposed important video understanding task and the development of GEBD can immediately support multimedia applications like video editing and summarization, and more importantly, it can spur progress in long-form video. This work prompts people to rethink the architecture design of GEBD and proposes a family of models named EfficientGEBD, which achieve SOTA performance with excellent speed. Our research encourages the community to design modern GEBD methods with the consideration of model complexity, which potentially can benefit the interpretation of long-form multimedia content.
Supplementary Material: zip
Submission Number: 4438
Loading