Rethinking the Architecture Design for Efficient Generic Event Boundary Detection

Published: 01 Jan 2024, Last Modified: 10 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of segmenting videos into meaningful temporal chunks, finds utility in various applications. This paper demonstrates that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed. We contribute to addressing this challenge by reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD base model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image backbone based GEBDs contain plenty of redundancy, motivating us to "modernize'' each component for efficiency. We also show that the GEBDs using image backbones conducting spatial-then-temporal greedy feature learning can suffer from a distraction issue, which might be the inefficient villain for GEBD and can be effectively addressed by using a video-domain backbone. The outcome of our exploration, EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7% performance gain and 280% speedup under the same backbone. The code is available at https://github.com/Ziwei-Zheng/EfficientGEBD.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview