Abstract: Recently, deep learning methods have been extensively applied for action recognition in videos. Most existing deep networks equally treat every video frame and directly assign a video label to all the frames sampled from it. However, discriminative action may occurs sparsely in a few key frames in a video, and other frames are less relevant or even irrelevant to the action class. Equally treating all the frames will hurt performance. To address this issue, we propose a temporal attention model which learns to recognize human actions in videos while focusing selectively on the informative frames. Our model does not need explicit annotations regarding such informative frames during training and testing. Specifically, we adopt Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) unit and attaches higher importance to the frames which are discriminative for the task at hand. Our method consistently improves on no-attention methods, with both RGB and optical flow based deep ConvNets. We achieve state-of-the-art performance on two challenging datasets of UCF101 and HMDB51.
0 Replies
Loading