Attention-Based Two-Phase Model for Video Action Detection

Xiongtao Chen, Wenmin Wang, Weimian Li, Jinzhuo Wang

Published: 2017, Last Modified: 07 Oct 2025CAIP (2) 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper considers the task of action detection in long untrimmed video. Existing methods tend to process every single frame or fragment through the whole video to make detection decisions, which can not only be time-consuming but also burden the computational models. Instead, we present an attention-based model to perform action detection by watching only a few fragments, which is independent with the video length and can be applied to real-world videos consequently. Our motivation is inspired by the observation that human usually focus their attention sequentially on different frames of a video to quickly narrow down the extent where an action occurs. Our model is a two-phase architecture, where a temporal proposal network is designed to predict temporal proposals for multi-category actions in the first phase. The temporal proposal network observes a fixed number of locations in a video to predict action bounds and learn a location transfer policy. In the second phase, a well-trained classifier is prepared to extract visual information from proposals, to classify the action and decide whether to adopt the proposals. We evaluate our model on ActivityNet dataset and show it can significantly outperform the baseline.