Deep Local Video Feature for Action RecognitionDownload PDFOpen Website

2017 (modified: 10 Nov 2022)CVPR Workshops 2017Readers: Everyone
Abstract: We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore propose to instead treat the deep networks trained on local inputs as local feature extractors. The local features are then aggregated to form global features which are used to assign video-level labels through a second classification stage. We investigate a number of design choices for this local feature approach. Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled local features leads to significant performance improvement.
0 Replies

Loading