Abstract: Action detection is a challenging task since it requires locating actions of interest in both spatial and temporal. In this paper, a multi-task cnn model (MTCNN) which employs both spatial and temporal modules is proposed to solve this task. Specifically, the spatial module fuses appearance and motion information of frames which helps to regress the action bounding boxes in every frame more accurately, while the temporal module utilizes the 3D ConvNet which can effectively capture the temporal correlation between frames thus predict the time interval of action more precisely. Moreover, these two modules share information before their final outputs and are trained simultaneously. Experiments on UCF101-24 and J-HMDB-21 datasets demonstrate that our proposed pipeline outperforms most state-of-the-art methods.
0 Replies
Loading